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ABSTRACT 



A method in which first and second representations of a 
document are provided, for example, by being made avail- 
able on one or more server computers connected to a 
computer network, such as the Internet or a corporate 
intranet. The first and second representations are resolution- 
dependent structured representations and have, respectively, 
first and second characteristic resolutions, the second reso- 
lution being greater than the first. The first representation, 
but not the second, is provided in digital form to an untrusted 
recipient. For example, the first representation can be trans- 
mitted through the network from the server on which the first 
representation is available to a client computer connected to 
the network. The second representation is converted to a 
third representation of the document, the third representation 
being a representation in a human-readable, nondigital form. 
For example, the second representation can be transmitted 
through the network in a secure manner to a trusted printing 
facility connected to the network. The trusted facility can 
then produce the third representation, for example by print- 
ing a hardcopy representation of the document. Finally, the 
third representation, but not the second, is provided to the 
untrusted recipient, for example, by physically transferring 
the third representation to the untrusted recipient. 

17 Claims, 20 Drawing Sheets 
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USING FONTLESS STRUCTURED 
DOCUMENT IMAGE REPRESENTATIONS 
TO RENDER DISPLAYED AND PRINTED 
DOCUMENTS AT PREFERRED 

RESOLUTIONS 5 

This application is a continuation-in-part of Ser. No. 
08/652,864 (now U.S. Pat. No. 5,884,084), filed May 23, 
1996, commonly assigned and having at least one common 
inventor. 10 

BACKGROUND OF THE INVENTION 

The present invention relates to structured document 
representations and, more particularly, relates to structured 
document representations suitable for rendering into print- 55 
able or displayable document raster images, such as bit- 
mapped binary images or other binary pixel or raster images. 
The invention further relates to data compression techniques 
suitable for document image rendering and transmission. 
Structured Document Representations 2 0 

Structured document representations provide digital rep- 
resentations for documents that are organized at a higher, 
more abstract level than merely an array of pixels. As a 
simple example, if this page of text is represented in the 
memory of a computer or in a persistent storage medium 2 5 
such as a hard disk, CD-ROM, or the like as a bitmap, that 
is, as an array of Is and Os indicating black and white pixels, 
such a representation is considered to be an unstructured 
representation of the page. In contrast, if the page of text is 
represented by an ordered set of numeric codes, each code 30 
representing one character of text, such a representation is 
considered to have a modest degree of structure. If the page 
of text is represented by a set of expressions expressed in a 
page description language, so as to include information 
about the appropriate font for the text characters, the posi- 35 
tions of the characters on the page, the sizes of the page 
margins, and so forth, such a representation is a structured 
representation with a great deal of structure. 

Known structured document representation techniques 
pose a tradeoff between the speed with which a document 40 
can be rendered and the expressiveness or subtlety with 
which it can be represented. This is shown schematically in 
FIG. 1 (PRIOR ART), As one looks from left to right along 
the continuum 1 illustrated FIG. 1, the expressiveness of the 
representations increases, but the rendering speed decreases. 45 
Thus, ASCII (American Standard Code for Information 
Interchange), a purely textual representation without format- 
ting information, renders quickly but lacks formatting infor- 
mation or other information about document structure, and 
is shown to the left of FIG. 1. Page description languages 50 
(PDLs), such as PostScript® (Adobe Systems, Inc., Moun- 
tain View, Calif.; Internet: http://www.adobe.com) and Inter- 
press (Xerox Corporation, Stamford, Conn.; Internet: http:// 
www.xerox.com), include a great deal of information about 
document structure, but require significantly more time to 5S 
render than purely textual representations, and are shown to 
the right of continuum 1. 

Continuum 1 can be seen as one of document represen- 
tations having increasing degrees of document structure: 

At the left end of continuum 1 are purely textual $o 
representations, such as ASCII. These convey only the 
characters of a textual document, with no information 
as to font, layout, or other page description 
information, much less any graphical, pictorial (e.g., 
photographic) or other information beyond text. 55 

Also near the left end of continuum 1 is HTML 
(HyperText Markup thlanguage), which is used to 



represent documents for the Internet's World Wide 
Web. HTML provides somewhat more flexibility than 
ASCII, in that it supports embedded graphics, images, 
audio and video recordings, and hypertext linking 
capabilities. However, HTML, too, lacks font and lay- 
out (i.e., actual document appearance) information. 
That is, an HTML document can be rendered 
(converted to a displayable or printable output) in 
different yet equally "correct" ways by different Web 
client ("browser") programs or different computers, or 
even by the same Web client program running on the 
same computer at different times. For example, in m 
any Web client programs, the line width of the rendered 
HTML document varies with the dimensions of the 
display window that the user has selected. Increase the 
window size, and line width increases accordingly. The 
HTML document does not, and cannot, specify the line 
width. HTML, then, does allow markup o f the struc- 
ture of the document, but not markup of the layout of 
the document. One can specify, for example, that a 
block of text is to be a first- level heading, but one 
cannot specify exactly the font, justification, or other 
attributes with which that first-level heading will be 
rendered. (Information on HTML is available on the 
Internet from the World Wide Web Consortium at 
http://www.w3.org/pub/WWW/MarkUp/.) 
At the right end of continuum 1 are page description 
languages, such as PostScript and Interpress. These 
PDLs are full-featured programming languages that 
permit arbitrarily complex constructs for page layout, 
graphics, and other document attributes to be expressed 
in symbolic form. 
In the middle of continuum 1 are printer control 
languages, such as PCL5 (Hewlett-Packard, Palo Alto, 
Calif.; Internet: http://www.hp.com/), which includes 
primitives for curve and character drawing. 
Also in the middle of continuum 1, but somewhat closer 
to the PDLs, are cross-platform document exchange 
formats. These include Portable Document Format 
(Adobe Systems, Inc.) and Common Ground (Common 
Ground Software, Belmont, Calif.; Internet: http:// 
www.commonground.com/). Portable Document 
Format, or PDF, can be used in conjunction with a 
software program called Adobe Acrobat™. PDF 
includes a rich set of drawing and rendering operations 
invocable by any given primitive (available primitives 
include "draw," "fill," "clip," "text," etc.), but does not 
include programming language constructs that would, 
for example, allow the specification of compositions of 
primitives. 

Known structured document representation techniques 
assume that the rendering engine (e.g., display driver 
software, printer PDL decomposition software, or other 
software or hardware for generating a pixel image from the 
structured document representation) have access to a set of 
character fonts. Thus a document represented in a PDL can, 
for example, have text that is to be printed in 12-point Times 
New Roman font with 18-point Arial Bold headers and 
footnotes in 10-point Courier. The rendering engine is 
presumed to have the requisite fonts already stored and 
available for use. That is, the document itself typically does 
not supply the font information. Therefore, if the rendering 
engine is called upon to render a document for which it does 
not have the necessary font or fonts available, the rendering 
engine will be unable to produce an authentic rendering of 
the document. For example, the rendering engine may 
substitute alternate fonts in lieu of those specified in the 
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structured document representation, or, worse yet, may fail and can be transmitted in a reasonable amount of time even 

to render anything at all for those passages of the document when transmission bandwidth is limited. 

for which fonts are unavailable. Lossless compression techniques are often to be preferred 

The fundamental importance of fonts to PDLs is when compressing digital images that originate as structured 

illustrated, for example, by the extensive discussion of fonts 5 document representations produced by computer programs, 

in the Adobe Systems, Inc. PostScript Language Reference Examples include the printed or displayed outputs of word 

Manual (2d ed. 1990) (hereinafter PostScript Manual). At processing programs, page layout programs, drawing and 

page 266, the PostScript Manual says that a required entry painting programs, slide presentation programs, spreadsheet 

in all base fonts, encoding, is an "[ajarray of names that programs, Web client programs, and any number of other 

maps character codes (integers) to character names-the 10 kinds of commonly used computer software programs. Such 

values in the array." Later, in Appendix E (pages 591-606), outputs can be, for example, document images rendered 

the PostScript Manual gives several examples of fonts and from PDL (e.g., PostScript) or document exchange format 

encoding vectors. (e.g., PDF or Common Ground) representations. In short, 

A notion basic to a font is that of labeling, or the semantic these outputs are images that are generated in the first 

significance given to a particular character or symbol. Each is instance from symbolic representations, rather than origi- 

character or symbol of a font has an unique associated nating as optically scanned versions of physical documents, 

semantic label. Labeling makes font substitution possible: Lossy compression techniques can be appropriate for 

Characters from different fonts having the same semantic images that do originate as optically scanned versions of 

label can be substituted for one another. For example, each physical documents. Such images are inherently imperfect 

of the characters 21, 22, 23, 24, 25, 26 in FIG. 2 (PRIOR 20 reproductions of the original documents they represent. This 

ART) has the same semantic significance: Each represents is because of the limitations of the scanning process (e.g., 

the upper-case form of "E," the fifth letter of the alphabet noise, finite resolution, misalignment, skew, distortion, etc.). 

commonly used in English. However, each appears in a Inasmuch as the images themselves are of limited fidelity to 

different font. It is apparent from the example of FIG. 2 that the original, an additional loss of fidelity through a lossy 

font substitution, even if performed for only a single 25 compression scheme can be acceptable in many circum- 

character, can dramatically alter the appearance of the ren- stances. 

dered image of a document. Known encoding techniques that are suitable for lossless 

A known printer that accepts as input a PDL document image compression include, for example, CCITT Group-4 

description is shown schematically in FIG. 3 (PRIOR ART). encoding, which is widely used for facsimile (fax) 

Printer 30 accepts a PDL description 35 that is interpreted, 30 transmissions, and JBIG encoding, a binary image compres- 

or decomposed, by a rendering unit 31 to produce raster sion standard promulgated jointly by the CCITT and the 

images 32 of pages of the document. Raster images 32 are ISO. (CCITT is a French acronym for Comite" Consultatif 

then given to an image output terminal (IOT) 33, which International de Telegraphique et Telephonique. ISO is the 

converts the images 32 to visible marks on paper sheets that International Standards Organization. JBIG stands for Joint 

are output as printed output 36 for use by a human user. 35 Bilevel Image Experts Group.) Known encoding techniques 

Unfortunately, the speed at which the rendering unit 31 can that are suitable for lossy image compression include, for 

decompose the input PDL description cannot, in general, example, JPEG (Joint Photographic Experts Group) 

match the speed at which the IOT 33 can mark sheets of encoding, which is widely used for compressing gray-scale 

paper and dispense them as output 36. This is in part because and color photographic images, and symbol-based compres- 

the result of decomposing the PDL description is indeter- 40 sion techniques, such as that disclosed in U.S. Pat. No. 

minate. As noted above, a PDL description such as PDL 5,303,313, "METHOD AND APPARATUS FOR COM- 

description 35 does not correspond to a particular image or PRESSION OF IMAGES" (issued to Mark et al. and origj- 

set of images, but is susceptible of differing interpretations nally assigned to Cartesian Products, Inc.(Swampscott, 

and can be rendered in different ways. Thus rendering unit Mass.)), which can be used for images of documents con- 

31 becomes a bottleneck that limits the overall throughput of 45 taining text characters and other symbols, 

printer 30. As compared with lossy techniques, lossless compression 

Accordingly, a better structured document representation techniques of course provide greater fidelity, but also have 

technology is needed. In particular, what is needed is a way certain disadvantages. In particular, they provide lower 

to eliminate the tradeoff between expressiveness and ren- compression ratios, slower decompression speed, and other 

dering speed and, moreover, a way to escape the tyranny of 50 performance characteristics that can be inadequate for cer- 

font dependence. tain applications, as for example when the amount of 

Data Compression for Document Images uncompressed data is great and the transmission bandwidth 

Data compression techniques convert large data sets, such from the server or other data source to the end user is low. 

as arrays of data for pixel images of documents, into more It would be desirable to have a compression technique with 

compact representations from which the original large data 55 the speed and compression ratio advantages of lossy 

sets can be either perfectly or imperfectly recovered. When compression, yet with the fidelity and authenticity that is 

the recovery is perfect, the compression technique is called afforded only by lossless compression, 
lossless; when the recovery is imperfect, the compression 

technique is called lossy. That is, lossless compression SUMMARY OF THE INVENTION 
means that no information about the original document 60 The present invention provides a structured document 
image is irretrievably lost in the compression/ representation that is at once highly expressive and fast and 
decompression cycle. With lossy compression, information inexpensive to render. According to the invention, symbol- 
is irretrievably lost during compression. based token matching, a compression scheme that has hith- 
Preferably, a data compression technique affords fast, erto been used only for lossy image compression, is used to 
inexpensive decompression and provides faithful rendering 65 achieve lossless compression of original document images 
together with a high compression ratio, so that compressed produced from PDL representations or other structured 
data can be stored in a small amount of memory or storage document representations. A document containing text and 
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graphics is compiled from its original structured represen- 
tation into a token-based representation (which is itself a 
structured document representation), and the token-based 
representation, in turn, is used to produce a rendered pixel 
image. The token-based representation can achieve high 
compression ratios, and can be quickly and faithfully ren- 
dered without reference to a set of fonts. 

In one aspect of the invention, a processor is provided 
with a first set of digital information including a first 
structured representation of a document. This first represen- 
tation is a resolution-independent representation. A plurality 
of image collections are obtainable from the first represen- 
tation. Each such obtainable image collection includes at 
least one image. Each image in each such collection is an 
image of at least a portion of the document, and has a 
characteristic resolution. With a processor, from the first set 
of digital information a second set of digital information is 
produced. The second set includes a second, relatively 
low-resolution structured representation of the document. 
The low-resolution representation is a lossless representa- 
tion of a low-resolution image collection that is one of the 
plurality of image collections obtainable from the first 
representation. Each image in the low-resolution image 
collection has a relatively low characteristic resolution. The 
low-resolution representation includes a plurality of tokens 
and a plurality of positions. The second set of digital 
information is produced by extracting the low-resolution 
tokens from the first representation, each low-resolution 
token including a set of pixel data representing a subimage 
of the low-resolution image collection, and determining 
from the first representation the plurality of positions of the 
low-resolution representation, each such position being a 
position in the low-resolution image collection of a subim- 
age from one of the low-resolution tokens. At least one such 
low-resolution subimage has a plurality of pixels and occurs 
at more than one position in the image collection. With a 
processor, from the first set of digital information a third set 
of digital information is produced. The third set includes a 
third, relatively high-resolution structured representation of 
the document. The high-resolution representation is a loss- 
less representation of a high-resolution image collection that 
is one of the plurality of image collections obtainable from 
the first representation. Each image in the high-resolution 
image collection has a relatively high characteristic resolu- 
tion that is greater than the characteristic resolution of the 
low-resolution image collection (e.g., sets of page image). 
The high-resolution representation includes a plurality of 
tokens and a plurality of positions. The third set of digital 
information is produced by extracting the high-resolution 
tokens from the first representation, each high-resolution 
token comprising a set of pixel data representing a subimage 
of the high -resolution image collection, and determining 
from the first representation the plurality of positions of the 
high-resolution representation, each such position being a 
position in the high-resolution image collection of a subim- 
age from one of the high- resolution t okens. At least one 
such high-resolution subimage having a plurality of pixels 
and occurs at more than one position in the image collection. 
The second and third sets of digital information thus pro- 
duced are then made available for further use. 

In another aspect of the invention, first and second rep- 
resentations of a document are provided, for example, by 
being made available on one or more server computers 
connected to a computer network, such as the Internet or a 
corporate intranet. The first and second representations are 
resolution-dependent structured representations and have, 
respectively, first and second characteristic resolutions, the 
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second resolution being greater than the first. The first 
representation, but not the second, is provided in digital 
form to an untrusted recipient. For example, the first repre- 
sentation can be transmitted through the network from the 

5 server on which the first representation is available to a 
client computer connected to the network. The second 
representation is converted to a third representation of the 
document, the third representation being a representation in 
a human-readable, nondigital form. For example, the second 

10 representation can be transmitted through the network in a 
secure manner to a trusted printing facility connected to the 
network. The trusted facility can then produce the third 
representation, for example by printing a hardcopy repre- 
sentation of the document. Finally, the third representation, 

15 but not the second, is provided to the untrusted recipient, for 
example, by physically transferring the third representation 
to the untrusted recipient. 

The invention will be better understood with reference to 
the drawings and detailed description below. In the 

20 drawings, like reference numerals indicate like components. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 schematically illustrates the tradeoff between 
expressiveness versus rendering speed in structured docu- 
25 ment representations of the PRIOR ART; 

FIG. 2 depicts examples of the letter "E" in different fonts 
of the PRIOR ART; 

FIG. 3 schematically illustrates a printer for printing a 
document from an input page description language file in the 
PRIOR ART; 

FIG. 4 shows the overall sequence of transformations 
applied to a structured document representation in a com- 
plete compression-decompression cycle according to the 
35 invention; 

FIG. 5 schematically illustrates a compressor for convert- 
ing an input page description language file into a tokenized 
representation, showing in more detail the transformations 
applied to a structured document representation in the com- 
40 pression phase of FIG. 4; 

FIG. 6 is a series of views showing how the compression 
and decompression phases can be decoupled from one 
another; 

FIG. 7 schematically illustrates a printer for printing a 
45 document from a tokenized representation; 

FIG. 8 schematically illustrates a display viewer for 
displaying a document from a tokenized representation; 
FIG. 9 shows hardware and software components of a 
5q system suitable for converting a structured representation of 
a document into a tokenized representation of the document; 

FIG. 10 shows a system including components suitable 
for converting a tokenized representation of a document into 
rendered images, such as printable or displayable page 
55 images; ^^^^^^ 
FIG. 11 illustrates the tokens and positions in an 
exemplary, highly simplified tokenized file format; 

FIG. 12 is a diagram of the encapsulation of dictionary 
blocks and pages (including position blocks and residual 
60 blocks) for a document represented in an exemplary, 
simplified, noninterleaved tokenized file format; 

FIG. 13 is a diagram of the encapsulation of dictionary 
blocks and pages (including position blocks and residual 
blocks) for a document represented in an exemplary, 
65 simplified, interleaved tokenized file format; . 

FIG. 14 is a flowchart of the steps in document compres- 
sion; 
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FIG. 15 is a flowchart of the steps in document decom- Thus, in a preferred embodiment, the inventive method 

pression; contemplates automatic conversion by a computer or other 

FIGS. 16-23 show the tokenized file format in a preferred processor of an initial, resolution-independent, structured 

embodiment, wherein document description, one that does not define a unique 

FIG. 16 shows the format of a dictionary block, including 5 visual appearance of the document, into a resolution- 
dictionary extensions, dependent structured document description that does define 

FIG. 17 shows the'format of a height class, a un ^ ue visual a PP earaDce of the document. This image- 
rs id u *u f * c j * * * i based, resolution-dependent description guarantees fidelity: 
FIG. 18 shows the format of a dictionary clearing section, . r • 5i_ » j i_ 
m i . e r Wne re & a set of page images must be generated anew each 
HG. 19 shows the format of a position block, including 1Q ^ a pDL document is rendered for displaVj prim> or other 

position extensions, human-readable readable media, with the DigiPaper 

HG. 20 shows the format of a strip, representation, a set of page images is generated once, up 

FIG. 21 shows the format of a residual block, f ron t, and then is efficiently and losslessly represented in a 

FIG. 22 shows the encapsulation of dictionary blocks and structured format that can be stored, distributed, and so 

pages for a document represented in the tokenized file is forth. DigiPaper maintains the expressiveness of the original 

format of the preferred embodiment, and PDL representation, without being subject to the unpredict- 

FIG. 23 shows the position blocks, residual blocks, and ability of rendering that is inherent in a non-image-based 

other elements of a page of a document in the tokenized file representation. Moreover, a DigiPaper representation of a 

format of the preferred embodiment; document can be converted into final output form more 

FIG. 24 is a flowchart showing the operation of a World 20 quickly and with less computational overhead than its PDL 

Wide Web viewer incorporating Web pages that have been counterpart. 

compressed as tokenized files; Although the DigiPaper tokenized representation is 

FIG. 25 illustrates a conceptual example of browse-now- image-based, it is nevertheless a structured document rep- 
print-later Web access as shown from the Web user's per- resentation; it is not merely a sequence of bits, bytes, or 
spective; 25 run-lengths. In this respect, DigiPaper differs from a raster 

FIG. 26 illustrates the encoding phase of invention for ( e -S" bitma P) ™ag«. * CCITT-4 compressed image, or the 

browse-now-print-later applications; llke * Moreover, in contrast with unstructured 

FIG. 27 illustrates a simple example of an embodiment of ^Presentations, DigiPaper achieves better image compres- 

the decoding phase of invention for browse-now-print-later ™ n . ratl0s * For exam P le ' Dl g* a P e ' typically achieves 2 to 

applications* and 30 greater compression than can be achieved using a 

- 0 . . . . _ ... TIFF file format with CCITT Group-4 compressed image 

FIG. 28 illustrates a more complex example of an embodi- , . . «. . . . » 

„f 4U„ <* u • ♦ i * v d ata > ant * offers a compression ratio with respect to the raw, 

ment of the mvention for browse-now-print-later applica- , . 5 * r L , ™ . , 

tions * yF uncompressed image data of as much as a 100 to 1. (TIFF, 

an abbreviation for Tagged Image File Format, is a trade - 

DETAILED DESCRIPTION 35 mark formerly registered to Aldus Corp. of Seattle, Wash., 

Overview and is now claimed by Adobe Systems, Inc., Mountain View, 

According to the invention in a specific embodiment, a Calif., with whom Aldus has since merged). Indeed, a 

richly expressive structured document representation, such DigiPaper file can be approximately the same size as the 

as a PostScript or other PDL representation, or PDF or other PDL file from which it is produced. 

document exchange language representation, is compiled or 40 Because DigiPaper offers rapid, predictable rendering, 

otherwise converted into a tokenized file format, such as the guaranteed fidelity, and good data compression, it is well 

DigiPaper format that will be described more fully below. suited for a wide variety of printing and display applications. 

The tokenized representation, in turn, can rapidly be ren- Thus the method for converting a document from a PDL or 

dered into an unstructured representation of the document other structured document representation into a DigiPaper 

image, such as a bitmap or a CCITT Group-4 compressed 45 tokenized representation according to the invention is a 

bitmap, that can be printed, displayed, stored, transmitted, method of wide utility. 

etc. As one example, the invention can be used to improve the 

The PDL or other initial representation of the document is throughput of a printer, such as a laser printer, ink-jet printer, 

capable of being rendered into page images in different or the like, by eliminating the rendering speed bottleneck 

ways, such as with different display or print resolutions or 50 inherent in PDL printers of the prior art (see discussion of 

with different font substitutions. For example, a given Post- printer 30 in connection with FIG. 3, above). The bottleneck 

Script file can be printed on two different printers of different can be eliminated because DigiPaper files can be decoded 

resolutions, e.g., a 300 dpi (dots per inch) printer and a 600 quickly, at predictable speeds. Speeds of about 5 pages per 

dpi printer, and the PostScript interpreter for each printer second have been achieved on a Sun SPARC- 20 workstation 

will automatically rescale to compensate for the different 55 using 600 dpi images. 

resolutions. As another example, a given PostScript file can Other examples of use of the invention will be described 

be rendered differently by two different printers if the two later on. 

printers perform different font substitutions. For all its rich Compression-Decompression Cycle 

expressiveness, then, a PDL representation of a document FIG. 4 illustrates the overall sequence of transformations 

does not uniquely specify an image of the document to be 60 applied to a structured representation of a document in a 

ouiput on the printer or display screen. complete compression-decompression cycle according to 

In contrast, in a preferred embodiment the tokenized the invention in the specific embodiment. The document to 

representation is specific to a particular rendering of the be transformed is assumed to be one that can be rendered as 

document, that is, a particular page image or set of page a set of one or more binary images, such as a document 

images at a particular resolution. Also, the tokenized repre- 65 containing black-and-white text and graphics. A PDL rep- 

sentation has no notion of font, and does not rely on fonts in resentation 40 of the document, such as a PostScript file, is 

order to be converted into printable or displayable form. input to a tokenizing compiler 41, which produces a token - 
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ized representation 42 of the document. The tokenized remotely to the processor that performs the tokenization. As 

representation 42, in turn, is input to a rendering engine 43 another example, tokenized representation 62 can be trans- 

that produces an output binary image 44. mitted from wherever it is generated to another location. In 

Tokenizing compiler 41 is also called a compressor, and particular, tokenized representation 62 can be generated by 

tokenized representation 42 is also called a compressed 5 a computer and transmitted across a local-area or wide-area 

representation. Tokenized representation 42 is compressed computer network to another computer, such as a print 

in the sense that it is smaller than the output bitmap 44. server or file server, or to a hardcopy output device, such as 

(Tokenized representation 42 can be comparable in size to a printer or a multifunction device. In still another example, 

PDL representation 40.) The production of a tokenized tokenized representation 62 can be replicated and dissemi- 

document representation from an input PDL document rep- 10 nated. For example, tokenized representation 62 can be 

resentation (e.g., the production of tokenized representation transmitted across a computer network, such as the Internet, 

42 from input PDL representation 40) is thus called the to a server computer, and cached there; thereafter, copies of 

compression phase of the transformation sequence, and the tokenized representation 62 can be called up from the server 

production of an output image from the tokenized represen- cache by remote clients. 

tation (e.g., the production of output binary image 44 from is In view (b) of FIG. 6, the decompression phase takes 

tokenized representation 42) is called the decompression place. Tokenized representation 65 is obtained at 64 by a 

phase of the sequence. device that will perform the decompression and output. For 

FIG. 5 again shows PDL representation 40 being input to example, tokenized representation 65 can be retrieved from 

tokenizing compiler 41 and tokenized representation 42 storage, received across a computer network or by telephone 

being produced thereby. Here, tokenizing compiler 41 is 20 (modem), or copied from another tokenized representation, 

illustrated in greater detail. In this embodiment, tokenizing Tokenized representation 65 is input to a rendering engine 

compiler 41 begins by processing input PDL representation 66, which outputs the document as a page image or set of 

40 through a PDL decomposer 45 to produce one or more page images that are or can be displayed, printed, faxed, 

page images 46. PDL decomposer 45 is of the kind ordi- transmitted by computer network, etc. 

narily used to turn PDL files into output images in known 25 In this example, although tokenized representation 65 of 

printers and displays; for example, for a PostScript input file the decompression phase (b) can be identified with token- 

40, PDL decomposer 45 can be implemented as a PostScript ized representation 62 of the compression phase (a), it need 

interpreter program executed by a processor. The page not be so identified. Tokenized representation 65 can also be, 

images 46 are bitmaps, or compressed bitmaps, that repre- for example, one of any number of copies of tokenized 

sent the pages of the document. In a conventional printer or 30 representation 62 made and distributed ahead of time. As 

visual display, the bitmaps 46 would be output to drive, another example, tokenized representation 65 can be a 

respectively, the IOT or display monitor. Here, however, representation of some document other than the one used to 

according to the invention, page images 46 are compressed produce tokenized representation 62. In any event, tokenized 

by a tokenizer or compressor 47. Compressor 47 takes the representation 65 is preferably a representation that has been 

page images and constructs a DigiPaper or other tokenized 35 created (i.e., compressed) from an image or set of images 

data stream or file, which compressor 47 can then store, whose resolution matches the output resolution of rendering 

transmit, or otherwise make available for further processing. engine 66. 

Thus, the output of compressor 47 is tokenized representa- Further examples of how a tokenized representation can 

tion 42. be saved for later use (as at 63) and then obtained for use (as 

Compressor 47 can be implemented as a software pro- 40 at 64) are described below with reference to FIGS. 9-10 and 

gram executed by a processor. the accompanying text. 

The steps by which compressor 47 can perform the Certain advantages obtain by decoupling the compression 

tokenization (compression) in this embodiment are and decompression phases as illustrated in FIG. 6. In 

described below with reference to FIG. 14 and the accom- particular, for printing applications, the computationally 

panying text. The DigiPaper file format, which is the pre- 45 expensive and unpredictably long task of decomposing PDL 

f erred form for tokenized representation 42 in this can be done ahead of time (e.g., off-line by a dedicated 

embodiment, and thus the preferred form for the output of server). Then the printer need only decompress the DigiPa- 

compressor 47, is described in detail below with reference to per tokenized format, which can be done quickly and 

FIGS. 16-23 and the accompanying text in numbered sec- efficiently and at predictable speeds. Accordingly, the printer 

tions 1 through 8. 50 can be made faster and, at the same time, less expensive, 

Also shown in FIG. 5 is an alternative way of producing since its computing hardware can be less powerful than what 

tokenized representation 42. According to this alternative, is required for a conventional PDL printer, 

tokenizing compiler 41 is designed so that PDL decomposer Some examples of rendering engines suitable for use as 

45 is not a standard PDL decomposer, but instead is closely rendering engine 66 are shown in FIGS. 7-8. FIG. 7 

coupled to compressor 47, so that no intermediate page 55 schematically illustrates a printer 76 that can print a docu- 

images 46 are produced. This alternative can be called direct ment from a tokenized representation, such as a DigiPaper 

compilation of input PDL description 40 into tokenized file. Printer 76 is an example of the bottleneck-free printer 

representation 42. It is illustrated by arrow 49. mentioned earlier. It is designed to accept an input tokenized 

The series of two views in FIG. 6 shows that the com- representation, such as tokenized representation 75, and 
pression and decompression phases of the transformation 60 convert that representation to printed output. It need not 
sequence of FIG. 4 can be decoupled from one another. In have an on-board PDL decomposer, and its on-board corn- 
view (a), the compression phase takes place. A PDL docu- puting power can accordingly be quite modest. Printer 76 
ment description 60 is input to a tokenizing compiler 61 to works by decompressing input tokenized representation 75 
produce a tokenized representation 62. The tokenized rep- with a decompressor 71. Decompressor 71 can be, for 
resentation 62 is then saved for later use at 63, For example, 65 example, an on-board processor executing decompression 
tokenized representation 62 can be stored in a file on a hard software. Alternatively, it can be implemented in dedicated 
disk or other persistent storage medium, either locally or hardware. Decompressor 71 produces a set of one or more 
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raster images 72, one for each page of the printed document. 
The raster images are provided to a conventional IOT 73, 
which produces printed output 77. 

FIG. 8 schematically illustrates a visual display 86 that 
can display a document given an input tokenized 5 
representation, such as a DigiPaper file. It is similar in 
concept to printer 76. Display 86 accepts an input tokenized 
representation, such as tokenized representation 85, and 
decompresses it with a decompressor 81, Decompressor 81 
produces a set of one or more raster images 82, one for each 10 
page of the printed document. The raster images can be 
produced all at once, or on an as-needed basis, according to 
the available display memory and other constraints on the 
environment in which display 86 operates. The raster images 
are provided to a display terminal 83, such as a cathode-ray is 
tube (CRT) or flat-panel monitor screen, which produces 
output that can be read by a human being. 

Like printer 76, display 86 need not have an on-board 
PDL decomposer. Thus, for example, if display 86 is 
included as part of a personal computer or other general- 20 
purpose computer, the processor (CPU) of the computer 
need not expend much computing power in order to keep 
display 86 supplied with pixels. This can be advantageous, 
for example, when display 86 is rendering documents 
received from afar, such as World Wide Web pages. 25 

Although the rendering engine examples 76, 86 shown in 
FIGS. 7-8 produce output images that are immediately 
visible as printed or displayed pages, other rendering 
engines can produced other kinds of image output. In 
particular, the output from a rendering engine suitable for 30 
use as rendering engine 66 can be an encoded bitmap (e.g., 
a CCITT Group-4 transmission to be received by a remote 
fax or multifunction device) or other unstructured document 
format. 

The steps by which decompressors, such as decompressor 35 
71 and decompressor 81, can perform the decompression in 
this embodiment are described below with reference to FIG. 
15 and the accompanying text. 
System Components 

FIG. 9 shows hardware and software components of an 40 
exemplary system suitable for performing the compression 
phase of the transformation sequence of FIG, 4. The system 
of FIG. 9 includes a general-purpose computer 100 con- 
nected by one or more communication pathways, such as 
connection 129, to a local-area network (LAN) 140 and also 45 
to a wide-area network, here illustrated as the Internet 180. 
Through LAN 140, computer 100 can communicate with 
other local computers, such as a file server 141. Through the 
Internet 180, computer 100 can communicate with other 
computers, both local and remote, such as World Wide Web 50 
server 181. As will be appreciated, the connection from 
computer 100 to Internet 180 can be made in various ways, 
e.g., directly via connection 129, or through local -area 
network 140, or by modem (not shown). 

Computer 100 is a personal or office computer that can be, 55 
for example, a workstation, personal computer, or other 
single-user or multi-user computer system; an exemplary 
embodiment uses a Sun SPARC-20 workstation (Sun 
Microsystems, Inc., Mountain View, Calif.). For purposes of 
exposition, computer 100 can be conveniently divided into 60 
hardware components 101 and software components 102; 
however, persons of skill in the art will appreciate that this 
division is conceptual and somewhat arbitrary, and that the 
line between hardware and software is not a hard and fast 
one. Further, it will be appreciated that the line between a 65 
host computer and its attached peripherals is not a hard and 
fast one, and that in particular, components that are consid- 
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ered peripherals of some computers are considered integral 
parts of other computers. Thus, for example, user I/O 120 
can include a keyboard, a mouse, and a display monitor, 
each of which can be considered either a peripheral device 
or part of the computer itself, and can further include a local 
printer, which is typically considered to be a peripheral. As 
another example, persistent storage 108 can include a 
CD-ROM (compact disc read-only memory) unit, which can 
be either peripheral or built into the computer. 

Hardware components 101 include a processor (CPU) 
105, memory 106, persistent storage 108, user I/O 120, and 
network interface 125. These components are well under- 
stood by those of skill in the art and, accordingly, need be 
explained only briefly here. 

Processor 105 can be, for example, a microprocessor or a 
collection of microprocessors configured for multiprocess- 
ing. It will be appreciated that the role of computer 100 can 
be taken in some embodiments by multiple computers acting 
together (distributed computation); in such embodiments, 
the functionality of computer 100 in the system of FIG. 9 is 
taken on by the combination of these computers, and the 
processing capabilities of processor 105 are provided by the 
combined processors of the multiple computers. 

Memory 106 can include read-only memory (ROM), 
random-access memory (RAM), virtual memory, or other 
memory technologies, singly or in combination. Persistent 
storage 108 can include, for example, a magnetic hard disk, 
a floppy disk, or other persistent read-write data storage 
technologies, singly or in combination. It can further include 
mass or archival storage, such as can be provided by 
CD-ROM or other large-capacity storage technology. (Note 
that file server 141 provides additional storage capability 
that processor 105 can use.) 

User I/O (input/output) hardware 120 typically includes a 
visual display monitor such as a CRT or flat -panel display, 
an alphanumeric keyboard, and a mouse or other pointing 
device, and optionally can further include a printer, an 
optical scanner, or other devices for user input and output. 

Network I/O hardware 125 provides an interface between 
computer 100 and the outside world. More specifically, 
network I/O 125 lets processor 105 communicate via con- 
nection 129 with other processors and devices through LAN 
140 and through the Internet 180. 

Software components 102 include an operating system 
150 and a set of tasks under control of operating system 150, 
such as an application program 160 and, importantly, token- 
izing compiler software 165. Operating system 150 also 
allows processor 105 to control various devices such as 
persistent storage 108, user I/O 120, and network interface 
125. Processor 105 executes the software of operating 
system 150 and its tasks 160, 165 in conjunction with 
memory 106 and other components of computer system 100. 

Software components 102 provide computer 100 with the 
capability of serving as a tokenizing compiler according to 
the invention. This capability can be divided up among 
operating system 150 and its tasks as may be appropriate to 
the particular circumstances. 

In FIG. 9, the tokenizing capability is provided primarily 
by task 165, which carries out a tokenizing compilation of 
an input PDL document according to the steps described 
below with reference to FIG. 14 and the accompanying text. 
The input PDL document can be provided from any number 
of sources. In particular, it can be generated as output by 
application program 160, retrieved from persistent storage 
108 or file server 141, or downloaded from the Internet 180, 
e.g., from Web server 181. 

FIG. 10 shows a system in which the decompression 
phase of the transformation sequence of FIG. 4 can be 
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performed in a variety of ways. The exemplary system of uncompressed or compressed image data to network output 

FIG. 10 is illustrated as a superset of the system of FIG. 9; device 215 for transmission elsewhere (e.g., to another 

in particular, it includes computer 100, file server 141, web computer in LAN 140 or the Internet 180). If the decom- 

server 181, LAN 140 and the Internet 180. Further, the pressed document includes hypertext links or other 

system of FIG. 10 adds various system components 200 that 5 annotations, as described below, computer 210 can interpret 

can be used to render tokenized representations of docu- a user's indicated selections of such annotations and can 

ments. Components 200 include a second general purpose transmit these selections across the network along with the 

computer 210, a network printer 220, a print server 230, and image data. 

a "smart" multifunction device 240 Network printer 220 is a printer that has its own on-board 

In operation of the system of FIG. 10, a document that has 10 £ m Pf ia S hardware including a CPU and memory. 

previously been converted from a PDL representation to a ™ e , refore J Unhke * 0Cal P nnter 212 ' °? w °* P rm ! er l 20 ™* 

tokenized representation (e.g., a document produced by °™ ^f 8 "™ ™ ^on "S ° f 3 £°n 1 

, . .. m i« • . +nn , c computer or server. Network printer 220 is thus a full- 

tokemzing compiler 1 65 in computer 100; a document from fled d renderm m b f e q{ ^ iokcni7c6 input 

file server 141 or Web server 181) is made available via a files imo hard QUt , n ^ ^ {{ {s ^ printer F 76 

network connection 229 to one or more of components 210, 15 mat was shown in FIG. 7. 

220, 230, 240. Each of these components can serve as a Continuing in FIG. 10, print server 230 is a computer that 

rendering engine and, in particular, as a decompressor. Each ca n control "dumb" printers and that can be used for 

is assumed to include communications software enabling the temporary storage of files to be printed by such printers, 

processor to obtain a tokenized representation of a Whereas general -purpose computer 210 is assumed to be a 

document, and decompression software enabling the pro- 20 computer that is used interactively by a human user, print 

cessor to turn that tokenized representation into image data server 230 is a computer used primarily for controlling 

suitable for a particular form of output. The decompression printers and print jobs. Its processor executes decompression 

software can be resident in the component, or can be software to produce images that can be sent to IOT 231 for 

downloaded along with the tokenized representation from immediate printout, sent to a prepress viewer 232 for pre- 

LAN 140 or the Internet 180 via connection 229. 25 liminary inspection prior to printing, or spooled (temporarily 

Computer 210 can be a general-purpose computer with stored) in persistent storage of print server 230 for later 

characteristics and hardware components similar to those of printing or prepress viewing. 

computer 100; an exemplary embodiment uses a Sun Multifunction devices are a class of standalone devices 

SPARC-20 workstation. Also like computer 100, computer that offer a combination of printing, copying, scanning, and 

210 has software that includes an operating system control- 30 facsimile functions. Multifunction 240 is assumed to be a 

ling one or more tasks. However, whereas computer 100 has "smart" device, having its own processor and memory, with 

compression software, computer 210 has decompression sufficient computing power to decompress its own tokenized 

software. That is, the software of computer 210 includes files without assistance from a host computer or server, 

software that itself renders the processor of computer 210 Here, it is shown providing output to the network via 

capable of decompressing the tokenized representation, or 35 network output device 242; if a multifunction device 240 has 

else includes network client software that the processor can software to support a paper user interface, the output data 

execute to download the decompression software, which in can include hypertext link selections or other information in 

turn can be executed to decompress the tokenized represen- addition to the image data. Multifunction device 240 is also 

tation. (Note that a computer can, of course, have both shown providing compressed image data to a facscimile 

compression and decompression software loaded into its 40 machine 241. For example, multifunction device 240 can 

memory, and that in some cases, a single computer can act contact facscimile machine 241 by ordinary telephone, and 

as both compression computer 100 and decompression com- send it compressed image data in CCITT Group-4 format, 

puter 210.) Facsimile machine 241 receives the fax transmission from 

Computer 210 is shown connected to a display monitor multifunction device 240 as it would any other fax 

211, a local printer 212, a modem 213, a persistent storage 45 transmission, and prints out a copy of the document, 

device 214, and network output hardware 215. Computer Persons of skill in the art will appreciate that the systems 

210 can control these devices and, in particular, can run of FIGS. 9-10 are intended to be illustrative, not restrictive, 

decompression software appropriate for each of them. and that a wide variety of computational, communications, 

For example, by executing decompression software and information and document processing devices can be 

appropriate for display monitor 211, the processor of com- 50 used in place of or in addition to what is shown in FIGS, 

puter 210 can cause a tokenized representation to be decom- 9-10. For example, connections through the Internet 180 

pressed into a form that display monitor 211 can display. generally involve packet switching by intermediate router 

Thus computer 210 and display monitor 211 together serve computers (not shown), and computer 210 is likely to access 

as a rendering engine for visual display. Similarly, computer any number of Web servers, including but by no means 

210 and local printer 212 can render the tokenized repre- 55 limited to computer 100 and Web server 181, during a 

sentation of the document as hardcopy output. Local printer typical Web client session. 

212 can be a "dumb" printer, with little or no on-board Tokenized Representations 

computing hardware, since computer 210 does the work of In a preferred embodiment, the tokenized document rep- 
decompression, resentation produced by the tokenizing compiler is orga- 
Further, computer 210 can render the document image(s) 60 nized in the DigiPaper format that will be described below 
in forms not immediately readable by a human being, but with reference to FIGS. 16-23. To ease the understanding of 
useful nonetheless. Computer 210 can run decompression the details of the DigiPaper format, some simplified token- 
software that outputs image data in unstructured (e.g., ized formats will first be considered with reference to FIGS. 
CCITT Group4) compressed format, which can be transmit- 11-13. These simplified formats are presented for purposes 
ted across telephone lines by modem 213. Computer 210 can 65 of illustrating certain ideas that are basic to the tokenized 
also output uncompressed or compressed image data to representations used in the invention, including but not 
persistent storage 214 for later retrieval, and can output limited to DigiPaper. 
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strates the concepts of tokens and positions 
rfghly simplified example. A one-page input 
pocumentTwhose image 1100 is shown, includes text 1101. 
'The document can be transformed into a tokenized repre- 
sentation 1110. Tokenized representation 1110 includes a set 
(or dictionary) of tokens 1111 and a set of positions 1112. 

Each of the tokens 1111 represents a shape that occurs 
somewhere in the document. Each token's shape is stored as 
a bitmap. Each of the positions 1112 represents where one of 
the tokens is to be placed, that is, where the token's shape 
occurs in the document. For example, the shape "t," which 
is associated with the first token, appears at a position whose 
(X, Y) coordinates are given by the ordered pair (10, 20). 
The shape "h," which is associated with the second token, 



or in two different typefaces) will be assigned two different 
tokens, one for each distinct shape in which that letter 
appears. 

For a one-page document image such as image 1100, it is 
5 not n ecessary to encode page information in the tokenized 
representation. For multi-page images of longer documents, 
the tokenized representation should include information 
about which token shapes appear on which pages. To this 
end, a separate set of positions can be maintained for each 
1 0 pa ge^of the document. Typically with tokenized 
representations, higher compression ratios are obtained for 
multi-page documents, because the lonaeTjhe document, the 

more often each token can be reused. \ J ■ 

k 12 and 13 illustrate, again lrr^simplified fashion, 



appears at a position whose (X, Y) coordinates are given by j£_sojaMifferent possibilities for multi-page tokenization for 



the ordered pair (20, 30). In general, each of the positions 
1112 includes a token index, that is, an index indicating a 
particular one of the tokens 1111, together with an (X,Y) 
coordinate pair that tells where the indicated token's shape 
occurs in the document 

To generate the tokenized representation 1110 from thi 
document image 1100, a computer can detect the different 
shapes that appear in the document image and note where 
they appear. For example, scanning from left to right begin 



mats. FIG. 12 shows a tokenized representation (also called 
an encapsulation) 1200 of a document whose rendered 
image is n pages long. Tokenized representation 1200 begins 
with file header 1205 and dictionary block 1206, which 
20 contain^ the tokens and their shapes. Thereafter come 
[uences of blocks for the pages of the multi-page docu- 
ment image. Blocks 1211, 1212, and 1215 pertain to page 1; 
1221, 1222, and 1225 pertain to page 2; and so forth 
throughout the remaining pages (as represented by ellipsis 



ning with the first line of text 1101, the computer first finds 25 125 flHncluding blocks 1291, 1292, and 1295, which pertain 

tl I *1 ll I t£l_JJ j1 -1 1 «_■« .1 ji **■ 



the shape "t", then the shape "h", then the shape "i", then the 
shape "s." The computer records each of these shapes as 
tokens 1111, and records their respective positions as posi- 
tions 1112. Continuing rightward, the computer next finds 



to page n. 

For each page of representation 1200, there is a page 
header block, a position block, and a residual block. For 
example, block 1211 is the header block for page 1; block 



another "i"; since this shape is already in the dictionary, the 30 1212 is the position block for page 1; and block 1215 is the 



computer need only record its position. The computer con- 
tinues its procedure until the entire document image has 
been scanned. In short, the computer can tokenize the image 
by finding each shape in turn, determining whether that 



residual block for page 1. The page header block indicates 
the beginning of a new page, and can contain additional 
page-specific information. The position block records which 
tokens are to be placed at which positions of the current 



shape is already in the token dictionary, adding it to the 35 page. The residual block stores the shapes, if any, that appear 



dictionary if not and, in any case, storing its position in the 
set of positions. 

To reconstruct the image 1100 from the tokenized repre- 
sentation 1110, a computer can read sequentially through the 



on this page and that are not in the token dictionary, such as 
shapes that appear only once in the document. 

FIG. 13 shows a tokenized representation 1300 of a 
multi-page document. Only the first two pages are shown, 



positions 1112 and, for each position, transfer the shape of 40 the remainder of the document being indicated by ellipsis 



the token whose index is listed to the listed (X,Y) coordi- 
nate. Thus, in reconstructing the image 1100, a computer 
will reuse the first token (the shape "t") twice, the second 
token (shape "h") twice, the third token (shape "i) four 



1350. The format is similar to that of tokenized representa- 
tion 1200 in FIG. 12, except that there can be dictionary 
blocks interleaved throughout the file. Tokenized represen- 
tation 1300 begins with file header 1305, followed by a 



times, etc. Generally, the more often a token's shape appears 45 dictionary block 1310, page header 1311, position block 



in a document, the greater the compression ratio obtainable 
through the tokenized representation. 

Note that the set of tokens 1111 is not a font. A tokenized 
representation of a document according to the invention 



1312, and residual block 1315 for page 1. Dictionary block 
1310 includes all the shapes that appear on page 1 of the 
document image. Thereafter, tokenized representation 1300 
continues at page 2 with an additional dictionary block 1320, 



includes no notions of semantic labeling or of character sets, 50 followed by page header 1321, position block 1322, and 



no encoding or mapping of sets of character codes to sets of 
character names. The shapes "t", "h", "i" and so forth are 
treated as just shapes, that is, particular bitmaps, and not as 
letters of an alphabet or members of a larger set of character 



residual block 1325 for page 2. Dictionary block 1320 
includes all the shapes that first appear on page 2 of the 
document image, that is, the shapes that were not needed in 
order to render page 1 but that are needed to render page 2. 



codes. The shapes appear in the dictionary in the order in 55 Accordingly, these new shapes are added to the dictionary 



which they first appear in document image 1101, not in any 
fixed order. The shapes that appear in the document dictate 
what will be in the dictionary, and not the other way around. 
Any shapes that occur repeatedly in the document can be 



that is used to render page 2. The format continues in this 
fashion (ellipsis 1350) until all pages are accounted for. 
Additional dictionary blocks can be included in the format 
whenever a new set of repeating shapes is needed to render 



used as token shapes, including shapes that have no sym- 60 subsequent pages of the document image, 



bolic meaning at all. The shapes that make up text 1101 in 
document image 1100 happen to be recognizable to English- 
speaking humans as alphabetic characters, but they could 
just as well be cuneiform characters or meaningless 



Tokenized Representation Extensions 

The format of a tokenized representation can be extended 
to accommodate information not readily subject to tokeni- 
zation. For example, if a source structured representation of 



^quiggles, and the lokenizer would process them in the same 65 a document contains black-and-white text together with a 



y^Conversely, a given letter of the alphabet that is to be 
rendered as two distinct shapes (e.g., at two different sizes 



color photograph, the image of the color photo can be 
compressed using JPEG or other compression techniques 
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and the black-and-white text image can be compressed using image generated directly from a PDL or other structured 

DigiPaper or other tokenizing compression according to the document description. Such images are inherently free from 

invention. The JPEG compressed photo, or a pointer to it, noise, losses, distortions, scanning artifacts, and the like, 

can be stored in an extension section of the position block Thus, there is no need to use approximate or heuristic 

for that page, if the tokenized format supports such extcn- 5 classifiers as is done in known methods for tokenizing 

sions. In particular, position block extensions can carry scanned documents. Instead, exact classification can be 

posmon-dependent information, and dictionary block exten- used> and time-consuming and error-prone heuristic com- 

S laS ^^Z^^ h ° n ^ t0 rCUSed m m ° re than pariSODS CaD bC eiiminated - lQ P articula '> exact classifier 

*■< * • L 'jr i. ^.t-j does not mistakenly confuse two characters, such as the 

Extensions can be used, for example, to support tokenized 10 , , (<1 „ , ' , 44 «,„ , , . . 

r U 4 Jj ♦ u wr u number "1 and the letter "1 , whose shapes closely 

compression of hypertext documents, such as World Wide L1 v j 

n , . a hi «/ l . ■ resemble one another. 
Web pages. As is well known, a Web page can contain 

hypertext links to other Web pages. If an HTML document The PDL decomposer used in step B can be, for example, 

intended as a Web page is compressed into a tokenized decomposer 45 from FIG. 5. The compressor used in steps 

representation according to the invention, its display able is c through E can be, for example, compressor 47 from FIG. 

text and bitmapped graphics can be tokenized and its link 5. (A direct compiler, per arrow 49 of FIG. 5, goes directly 

information (i.e., universal resource locator, or URL, fr° m ste P A t0 ste P E.) 

information) stored in extensions. If the same link is used FIG. 15 shows the steps for rendering a tokenized repre- 

more than once in the document, its URL can be stored in a sentation into an output image. A tokenized representation, 

dictionary extension, and the page positions which are 20 such as a DigiPaper file, is read into working memory (step 

considered active and which designate that link can be G). Thereafter, a loop begins (step H) as the decompressor 

stored in position extensions. If the link occurs only once, reads through the blocks of the file. If the next block is a 

both the URL and the page position can be stored as a dictionary block (step I), the dictionary block is read (step J) 

position extension. and its tokens added to any tokens already in the dictionary 

Extensions can also be used to support tokenized com- 25 stored in working memory (step K). Alternatively, if the next 

pression of objects containing embedded objects, such as block is a page header (step L), that page is decompressed 

Microsoft OLE objects (Microsoft Corp., Redmond, Wash.). and rendered (steps M through Q): The position block for the 

An embedded object, such as an active spreadsheet embed- page is read (step M); it will be interpreted with respect to 

ded in an otherwise-textual document created with a word the set of tokens of the dictionary currently stored in 

processing application program, can be represented by 30 working memory. The residual block is also read (step N). 

incorporating appropriate information (e.g., a pointer to the The tokenized symbols are then converted into a bitmap 

object) in the position block extension of the page of the image of the page (step O), using the information from the 

rendered document on which that object is to appear. If the position block for the page and the tokens in the currently 

object is embedded at multiple points in the document, its stored dictionary. The individual bitmaps for the tokens are 

corresponding information can be put into a dictionary 35 transferred (for example, using a bit-bit operation) into the 

.^^^ extension. >larger bitmap that is being constructed for the page. Also, 

^^€etnpression and Decompression Method Steps any extensions are processed at this time. Next, residuals are 

The flowcharts of FIGS. 14 and 15 illustrate, respectively, rendered, their bitmaps being transferred into the larger 

how the compression and decompression software works in bitmap as well (step P). The completed page image is output 

the specific embodiment. 40 (step Q) to a display screen, IOT, persistent storage, 

FIG. 14 shows a sequence of steps for compiling a network, fax, or other output mechanism. The loop contin- 

structured document representation into a tokenized repre- ues (step H) until the entire tokenized representation (or any 

sentation. A structured document representation, such as a desired portion thereof) has been processed (step R). 

PDL file, is read into working memory (step A) and is Details of the DigiPaper Tokenized Representation 

rendered into a set of bitmap images, one per page (step B) 45 ^ next gecti numbeTcd x throu ^ 8 for 

by a conventional PDL decomposer Thereafter, tokening conveniencej t in detail a format for tokeniz * d 

compression is performed (steps C, D, and E) by the of documems mat ^ ^ m a ferred 

compressor First, *e bitaap images are analyzed to identify ment of ^ mvention ^ for ^ refefence 

so that muluple occurrences of the same shape ^ can be 50 (fleedIess > fa be ^ ^ 

assigned to the same token (step D). Thereafter, the token £ Qtatkms previously with £ t 

dictionary, position information, and residuals are encoded tQ pj^g n_i3 

(step E), together with any extensions, such as hypertext * 

links or embedded nonbinary image components. This com- Sectl0n 1 dlscusses desi S n catena that influenced the 

pletes the construction of the tokenized compressed 55 design of the DigiPaper format. Section 2 gives an overview 

representation, which is then output (step F). of ^ components of a compressed data stream in this 

The step of identifying shapes (step C) is performed in the format ' without makin S reference to the higher-level 

specific embodiment using a connected components structures of the data stream. Sections 3 through 5 give more 

analysis, although any other suitable technique can be used. detailed descriptions of each of those components. Section 

The step of classifying shapes (step D) is performed in the 60 6 descnbes the algorithm used to build a Huffman tree, 

specific embodiment using a very simple, lossless classifier: SecUoa 7 ^ ves a description or a mgher-levei data stream 

Two shapes are considered to match one another if and only that enca P s ulates the components. Section 8 discusses some 

if they are bitwise identical. This simple classifier contrasts additional aspects of this data stream format, 

favorably with the cumbersome classifiers used in the Th e text of Sections 1 through 8 includes references to 

tokenization of scanned documents in the prior art, and 65 Tables 1 through 12. 

points to an advantage of the invention: According to the These tables can be found at the end of the Detailed 

invention, the document image that is being tokenized is an Description. 
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1. Introduction tains height classes (see Section 3.1) such as, for example, 

. ,i . . r . . f t height class 1 1630, height class 2 1640, and further height 

Criteria that influenced the design of this coding format classes (as Seated by ellipsis 1650). Following the height 

include. classes are an END code 1660 and dictionary extension 

It should be possible to encode multiple pages in a single 5 section 1670. 

stream, as the compression achieved for multiple-page The first value 1610 in a dictionary block is a 32 bit 

documents is considerably better than the compression (unsigned) value indicating the number of tokens stored in 

achieved for single-page documents. mat block. This value, the token count, is itself stored using 

If a document, encoded in this format, is stored in a file, the encoding from Table 1. If the number of tokens is 

then it should be possible to recreate any given page io specified as being zero, then the first value 1610 is a 

without having to parse fully all the preceding pages. dictionary clearing code (as a dictionary block containing 

The coding of individual values within the format should zero new tokens & not useful); see Subsection 3.3 for details 

be as simple as possible, consistent with the goal of on dictionary clearing codes. 

good compression; this allows implementation in low- Following the token count 1610 is a 1-bit flag 1620 

cost devices. 35 indicating which use count encoding table is used for this 

dictionary block: If the bit is 0, Table 3 is used to encode 

2. Data Stream Components token use counts; if the bit is 1, Table 4 is used. 

3 1 He i slit Classes 

A data stream encodes a document, which consists of a ' ;„ , . . . ,. 4 . ... , , 

numkiir i-vf -r*r\rrar ^~A«i , All the tokens stored in the dictionary block are sorted by 

number oi pages. The data stream comprises some number _ . , . , . , . JiL , / . ... . J 

r j* 4.* ui i li i j "j i i_i i 20 their heights and widths, and grouped into height classes: 

of dictionary blocks, position blocks, and residual blocks. _ ? , , . . \. 6 \. . , . a n r 

in . t „ ' en A / wen ♦ r cD it i •« j groups of tokens having the same height. All tokens of a 

All bytes are failed from MSB to LSB. Unless specified * l. • i_* * AL « . , « L • ■ 

„*u n -si u -4 i « i < jj certain height are in the same height class. Within the height 

°skg T^bie ? unsig are class > the y are *y increa i 

2.1. Dictionary Blocks , Tbe 1 22 Mt °/ bei8 r ht f^S '"^ST ? "l "^u 

a j* ** ui i * • • c * , 25 class 1700 contams a first code 1710, a first token s width 

A dictionary block contains mformation about a number t ,™ VY , / *, ,! .r^n f 

r c c / ■ t ^ n >* u,% / nn A -^a • a 1720 » a use count 1730 of token 1, a delta width 1740 of 

ot tokens, bach tokens bitmap (and associated size and 4 , « t1 « n . ' , JV ... ... 

w ;a*w\ ct n nA *u- a ui i c »u • f token 2 » a use count I 750 of token 2 » additional delta widths 

width) are stored in the dictionary block. Some other infor- , 4 r j,.,. , V «. . 

m ot, rt r. ,u™.t c oi.i * .u j • # - and use counts for additional tokens (as indicated by ellipsis 

mation about each token is also stored in the dictionary i T jca\ cnti^ a i^n • u .* \ c 

block.Specifically.thenumberofusesofeachtokenOteui , 1760) ' m *™ °° de 1?7 °' A* 26 1780 (^^es) of the 

count) is encoded along with the token. This allows the 30 ^pre^token image, and the compressed token , mages 

decoder to build a Huffman tree giving the encoding of each ~ - i 'j.™, T¥ . 

token number. 3.1.1. Encoding of Token Heights 

n- Tr , i , i » i— *t *t • . i i . The first code 1710 in the height class is the difference in 

Dictionary blocks can be arbitrarily interleaved between , . Lt , . . , - , ™ 

pages, except that there mustbe at least one dictionary block „ ^ lght ^,^7 hel § ht cIa *f ' Qa f * S VP*"*™ 

before the first position block. 35 the smal, « t ^ ortest ) on "]>< 80 ^ del !? s u1 a re alwa y s 

2 2 Position Blocks positive. The deltas are encoded according to Table 2, except 

' A position block contains a number of triples, each ' hat each hei Sf" c } is f s b J at ^ one 

comprising an X coordinate, a Y coordinate, and a token &0m * e P r ™ cla f f bei 8 h ^ th f ^t.^U is decre- 

f * i r j ■ . ■ . « | | mented by one before being encoded. There is an imaginary 

number. The tokens referenced in any given position block An . • Ut / f , . , , B 4 , c 4 , ? • / 

m „ot 4« ^- ui i a. * j /■ 4 0 height class of height zero preceding the first real height 

must be defined in some dictionary block that precedes (in , *u « * i » u • u* • j j j* ,i r * 

f , A . etroom x „ n - t ; nn ui^,.i, cla ss» so the first class's height is encoded directly. The last 

the data stream) the position block. . . /. , . c „ . , & ^xm. j e n*,* « 

c, . 'ui„X ,v * ^ i .* . *u r height class is followed by an END code from Table 2 

Each position block is interpreted relative to the union of - , c ... . - . . , ^ , 

all previous dictionary blocks: it can contain any token from f^. L° f a Valld f h "S ht , ?° de - 

any of those blocks (but see Subsection 3.3). The decoder c 31 J:,5 DOod *« ° f 

»u™f« m 11 *u 4 i * n *u i • . • 45 Within each height class, the tokens are sorted by uicreas- 

thererore must consider all the tokens in all those dictionary . ., iL ™ - , . J . 

. i ,-11 n u .,«-™ u ^ *u / width. Ine width of each token is represented as a 

blocks, and build a Huffman tree based on the use counts j % £ 4 . . A , , . j . , . . , 

n ^~„;„*„A „ r **u ««u*i • a * a j.u*i ditterence trom the previous token s width: this is always 

associated with each token m order to decode the token 4 . ™, £ t . . , .... ._L A . j" 7 . 

«» m u»«. • »u ■♦■ ui i n ♦ *i u non-negative. The first token s width 1720 is encoded 

numbers encoded in the position block. Details on building j- .1 /• j n • 1 ^ -j 1 

fV,; c uuff^.n •« c *• < directly (i.e., as a delta from an imaginary token of width 

this Huttman tree are given in Section 0. . ;L; ... , , . % t _ ' kT 

<Ty,_„ 1 . ° t U1 , 50 zero). The widths are encoded using Table 2. Note that the 

lnere can be at most one position block per page. c . . . . . , 

2 3 Residual Blocks encoding for a width delta w is exactly the same as the 

a r .c;^„i ki^m, 0 u-f«, n + u * * ■ 11 *u encoding of w+1 as a height delta. The last token in each 

A residual block encodes a bitmap that contains all the , . u , 7 . c M , , & „ x ™ . r _ , . _ 

nnn t^r, „e o !♦ « u ^ j 1 tU t height class is followed by an END code from Table 2. 

non-token portions or a page. It can be decoded without - r\ c ^ TT t 

r t U1 , c * 3.1.3. Encoding of Use Counts 

reference to any block of any type. D u t . \ . , 4 __. . 

rn , , . j i i_i 1 55 Each token has an associated use count. This is, m 

There can be at most one residual block per page. 4 , Ct . *u**u- *i ■ n \ L 

K K 6 concept, the number of times that this token occurs in all the 

3. Dictionary Block Encoding position blocks between this dictionary block and the next 

dictionary block. In some cases, it may not be exactly this 

A dictionary block contains a set of tokens to be used value (i.e., the decoder should not count on the token 

(together with the tokens from previous dictionary blocks) to 60 occurring exactly that many times in those position blocks), 

decode subsequent position blocks. These use counts should only be used to build the Huffman 

The format of a dictionary block is shown in FIG. 16. coding of token numbers (see Section 4). 

Dictionary block 1600 contains a first value 1610, to be Some tokens are single-use tokens. This means that the 

described shortly, that is either a token count or a dictionary compressor guarantees that this token is used exactly once, 

clearing code. This is followed by a flag 1620 indicating 65 and so the decompressor may be able to free up memory 

which use count encoding table is to be used for this once it has used the token, typically, such tokens are large, 

dictionary block. Additionally, dictionary block 1600 con- so the memory savings that this can afford the decompressor 
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is significant. For single -use tokens, the use count is really 
one, but is encoded as zero to distinguish it from other 
tokens which happen to be used only once between this 
dictionary block and the next (singletons), but which theo- 
retically could be re-used later. Single-use tokens should not 
be completely forgotten once they are used (they must be 
considered when building Huffman trees, even if they can no 
longer occur), but the only information that needs to be 
retained is the size of the token and its position within its 
dictionary block (needed to break ties when computing the 
token's Huffman code); its image information can be dis- 
carded. 

This might seem like a waste — once the single-use token 
has occurred in some position block, then it cannot reoccur, 
and so its portion of the token number code space is wasted. 
However, suppose that the decompressor skips the position 
block where the token's use occurs. This might happen, for 
example, because someone was interactively browsing a file 
stored in this format, and they skipped over the page where 
the single-use token was used. The decompressor would 
then have no way of knowing, short of completely parsing 
that skipped page's position block, that the single-use token 
had been used; this extra parsing (possibly of many skipped 
pages) is detrimental to interactive use; it introduces an 
unneeded dependence between the parts of the file. 

In some applications, singletons and single-use tokens 
might not be stored in the token dictionary; they might be 
encoded in the residual block of the page where they occur 
(this generally yields better compression and reduced 
decoder memory requirements). If they are present in this 
dictionary block, Table 3 should be used to encode use 
counts; if they are not present, Table 4 should be used. The 
use count encoding flag bit (in the dictionary block header) 
indicates which table was used. Note that Table 4 cannot 
encode use counts of 0 or 1. 
3.1.4. Encoding of Token Images 

All the token images within a height class are concat- 
enated left-to-right in the same order (i.e., sorted by increas- 
ing width), with the first (smallest) being placed leftmost. 
This single image is then CCITT Group-4 compressed. The 
Group-4 compression uses no EOL codes, and fills bytes 
MSB-to-LSB. 

The length (in whole bytes) of the encoding is written out 
as a 32 bit value using Table 1. The compressed image is 
then written out, beginning at the next byte boundary in the 
file. The next height class begins on the byte boundary 
following the compressed image; thus, the Group-4 com- 
pressed image of the height class begins and ends on a byte 
boundary. 

In some cases, Group-4 compressing the image of the 
height class increases its size. When this happens, the 
encoder may store the image bitmap uncompressed. It 
indicates this by saying that the length of the stored bitmap 
is zero bytes. This is an impossible byte count for the results 
of compression, as no height class is empty, so the decoder 
can recognize this situation. The size of the height class 
bitmap is known to the decoder at this point, so it knows the 
number of bytes it actually occupies. Each row of the bitmap 
is padded to end on a byte boundary. 
3.2. Dictionary Block Extensions 

After the last height class, the dictionary block may 
contain extensions. At the moment, this section of the 
dictionary block is largely undefined. It is expected that it 
will be used to store extra information about the tokens in 
the dictionary block; for example, what ASCII characters 
they represent, if this has been determined. 

The only part of the extension section that is defined in 
this embodiment is the length field. Immediately following 
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the last height class is a 32 bit value (stored using the 
encoding in Table 1) giving the size, in bytes, of the 
dictionary block extension section. The extension section 
itself, if any, begins on the next byte boundary. If there are 

5 no extensions, a length of 0 should be given. 
3.3. Dictionary Clearing Codes 

If the value of the number of tokens field in a dictionary 
block is zero, then this indicates that this dictionary block is 
preceded by a dictionary clearing code. Such clearing codes 

10 reduce storage requirements in the decompressor, as well as 
improve the storage efficiency by reducing the number of 
tokens in the Huffman tree, and thus the number of bits 
required to encode token numbers in subsequent position 
blocks. They indicate that the token dictionary stored in the 

15 decompressor should be cleared. However, some tokens 
from previous dictionary blocks (the ones the compressor 
thinks most likely to be useful in the future) may be retained. 

The format of this clearing section is shown in FIG. 18. 
Dictionary clearing section 1800 contains a value 1810 

20 indicating the number of tokens to be retained 1810, fol- 
lowed by the Huffman codes for the retained tokens (e.g., 
code 1820 for the first retained token, code 1830 for the 
second, etc., additional codes being represented here by 
ellipsis 1840). Following the Huffman codes is a value 1850 

25 indicating the number of new tokens in this dictionary block. 
The clearing section occurs immediately after the "zero 
tokens in this dictionary block" flag that indicates its pres- 
ence. The number of tokens to be retained 1810 is encoded 
using Table 1. The final value in the section is the number 

30 of new tokens in this dictionary block; the dictionary block 
then proceeds as usual. Note that the Huffman tree must be 
built, as it would have been for a position block at this 
location in the file. 

4. Position Blocks 

35 

Position blocks encode binary images by storing a 
sequence of (token position, token number) pairs. A position 
block does not contain the size of the image rectangle that 
it represents; this is left to some other layer of the file format. 

40 The tokens used within any position block can be drawn 
from any dictionary block which precedes it in the file 
(unless some preceding dictionary block contained a dictio- 
nary clearing code; see Subsection 3.3). The tokens are 
referred to by their Huffman codes. These are computed by 

45 (logically) concatenating all previous dictionary blocks, and 
then building a Huffman tree of the use counts of the tokens 
in those blocks. Note that this tree must be rebuilt every time 
a new dictionary block is encountered in the file. The exact 
algorithm for building the Huffman tree is given in Section 

50 6. 

For the purposes of this discussion, it is assumed that the 
coordinates of the top left corner of the image rectangle 
encoded by this position block are (0,0). Since all the 
coordinates within the block are relative, the actual coordi- 

55 nates can be anything; everything is encoded relative to this 
top-left position. Coordinates increase down the image, and 
rightwards across the image. Usually, the Y coordinate 
represents the vertical position of an instance of a token, and 
the X coordinate represents its horizontal position. However, 

60 there is a transposed encoding mode, intended for docu- 
ments where the primary direction of text flow is vertical 
(such as occurs in Chinese text). In this case, the X coor- 
dinate of a token position represents its vertical position in 
the image, and the Y coordinate represents its horizontal 

65 position. 

The position that is encoded for a token is the position of 
its bottom left corner pixel in the normal encoding mode, 
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and the position of its top left comer pixel in transposed 
encoding mode. 

The format of a position block 1900 is shown in FIG. 19. 
The first value 1910 is the number of tokens present in this 
position block, encoded using Table 1. Following that is 
some information about the encoding used within this block. 
The fields here are: 
Modal Delta X Value 

This unsigned 4-bit field (field 1920) gives the modal 
delta X value. This value is subtracted off all delta X values 
before they are encoded, and must be added back upon 
decoding. 
Strip Height 

This 2-bit field (field 1930) gives the height of the strips 
that the image is divided into. Three values are currently 
defined: 0, 1, and 3, indicating strip heights of 1, 2, and 4 
pixels respectively. 
First X Encoding Table Flag 

This 2-bit field (field 1940) indicates which encoding 
table was used to encode the first X position within each 
strip; see Tables 5 and 6. Values of 2 and 3 are currently 
undefined. 

Delta X Encoding Table Flag 

This 2-bit field (field 1950) indicates which encoding 
table was used to encode the delta X values within each 
strip; see Tables 7, 8, and 9. A value of 3 is currently 
undefined. 

Delta Y Encoding Table Rag 

This 2-bit field (field 1960) indicates which encoding 
table was used to encode the delta Y values between strips; 
see Tables 10, 11, and 12. A value of 3 is currently undefined. 
Transposition Flag 

This 1-bit field (field 1965) contains 0 if the position block 
is encoded normally, and 1 if it is encoded transposed. 

Following this initial encoding information, the locations 
and identifications of the tokens appearing in this image are 
encoded. The image is divided up into strips of the size 
encoded by the strip size field (1, 2 or 4 pixels). In the 
normal coordinate encoding mode, the strips divide the 
image into horizontal slices; in the transposed encoding 
mode, the strips divide the image into vertical slices. For 
clarity, strips will be described in the context of the normal 
encoding mode (in terms of rows). 

In position block 1900, the strips include strip 1 1970, 
strip 2 1980, and additional strips (as indicated by ellipsis 
1985). Following the strips is a position extension section 
1990. 

The first row of the first strip in a position block is the top 
row of the image. The strips are encoded top-to-bottom. 
Only strips containing invocations of some token are actu- 
ally coded; each nonempty strip encodes the number of 
strips that were skipped between it and the previous non- 
empty strip. Within each strip, the tokens are sorted by 
increasing X position. 

The format of a single strip is shown in FIG. 20. Strip 
2000 contains the Y difference 2010 from the previous strip, 
the X position 2020 and Y position 2030 of the first token, 
the Huffman code 2040 of the first token, the delta X 
position 2050 to the second token, the Y position 2060 of the 
second token, the Huffman code 2070 of the second token, 
and additional delta -X, Y, and Huffman code information for 
additional tokens (as indicated by ellipsis 2080). At the end 
of strip 2000 is an END code 2090. 

The first value in a strip (e.g., first value 2010 in strip 
2000) is the difference between this strip's starting Y posi- 
tion and the previous strip's starting Y position. Since strips 
are constrained to begin on rows divisible by the strip height, 
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the encoder divides the actual difference by the strip height 
then encodes it. The encoding is done using one of Tables 10 
through 12; which table is used is indicated by the "Delta Y 
encoding table flag" in the position block's header. There is 
5 an imaginary nonempty strip just above the top of the image; 
this is used to compute the offset for the first strip's Y 
position. 

The X position of the first token within each strip is 
encoded using Tables 5 or 6; which table is used is indicated 

10 by the "First X encoding table flag" in the position block's 
header. The X position is encoded as an offset from the first 
X position of the previous strip (or as an absolute value, in 
the case of the first strip). 
The Y position of each token within a strip is encoded 

is with 0, 1, or 2 bits, depending on the strip height (strip height 
of 1, 2 or 4). The value is the number of rows that this 
token's reference position (its lower left corner) is down 
from the top of the strip. 

The X position of each token in the strip, except the first, 

20 is encoded (in the standard encoding mode) by taking the 
token's X position, and subtracting the X position of the 
previous token, plus the previous token's width; this com- 
putes the difference in X between this token's lower left 
corner and the pixel to the right of the previous token's 

25 lower right corner. In the transposed encoding mode, the X 
position of each token in the strip is encoded by taking the 
difference between the token's X position and the X position 
of the previous token, plus the previous token's height. 
Thus, in the transposed encoding mode, what is encoded is 

30 the vertical difference between this token's upper left corner 
and pixel below the previous token's lower left corner. 

In either case, the modal delta X value given in the 
position block's header is subtracted from this value before 
it is encoded; this ensures that the most common value 

35 encoded is always zero. The encoding table used for the 
resulting signed value is given by the "Delta X encoding 
table flag" value; it is one of Tables 7 through 9. 

The last token in a strip is flagged by an END code (drawn 
from the appropriate delta X encoding table) instead of a 

40 delta X code. Since strips are never empty, there is no way 
to encode an END code in any of the first X encoding tables. 

Note that there is no end-of-image code; instead, the last 
strip is flagged by a Y position which is outside the possible 
range for this image rectangle. This position does not start a 

45 real strip, so there are no token positions following it. 
Instead, it is followed (see FIG. 19) by a position block 
extension section 1990, similar to the dictionary block 
extension 1670 (from FIG, 16). Currently, the only part of 
section 1990 that is defined is the length field: a 32 bit value 

50 (stored using the encoding in Table 1) giving the size, in 
bytes, of the position block extension section, which begins 
on the next byte boundary. A length of 0 is used to indicate 
an empty extension section. 

55 5. Residual Blocks 

Each page's bitmap is encoded in two parts: the position 
block, giving the tokens from the dictionary used on this 
page, and the residual bitmap. The residual bitmap encodes 
all the marks on the page that were not encoded in the 

60 position block. On decoding, the tokens specified by the 
page's position block should first be written into the uncom- 
pressed bitmap; the residual block should then be combined 
with that bitmap via an OR operation. The bitmap stored in 
the residual block may be smaller than the original page 

65 bitmap. If the residual bitmap is empty (all white), then the 
residual bitmap fields (including the length field) all contain 
zero, and there is no encoded residual bitmap. 
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FIG. 21 shows the format of a residual block 2100. All the 
fields, except the actual encoded residual bitmap, are f[32] = 0; 

unsigned 16 or 32 bit values. They are encoded as 2 or 4 

bytes respectively, with the most significant byte appearing for (/ = 31; ' >= 0; ' * 

first ("big-endian" encoding). 5 f[ /] = (f [/ + 1] + c[/+ i^ 2 ; 

Left Edge of Residual Bitmap 

This field (field 2110) gives the position of the left edge f[l] is now the first (lowest) value for the all the codes 

of the residual bitmap relative to the original bitmap. It is a having length 1 bits. These should be assigned in 

2 byte value. !0 increasing order, in the order that the tokens occur in 

Top Edge of Residual Bitmap the file: the first loken whose °° dQ is of len S th 1 S ets 

. . _ . . , , „-„ AV . . L . . c assigned the code f[l], the next of length 1 gets the 

This is a 2 byte value (value 2120) giving the position of ^de f[l]+l, etc. 



the residual bitmap's top edge relative to the original bitmap. 
Width of Residual Bitmap 



7. Encapsulating the Blocks 

„ . . „ , , . . . . ... , L 35 The current encapsulation of these blocks is quite simple; 

This is a 2 byte value (value 2130) giving the width of the other more complex encapsu i ations are poss ^ le . ^ £ ne 

residual bitmap. described here is minimal, but is quite easy to parse, and 

Height of Residual Bitmap allows random access to pages without undue difliculty. The 

This is a 2 byte value (value 2140) giving the height of the fields in this encapsulation are shown in FIG. 22. 

residual bitmap 20 Identifying Header 

This is a 5-byte field (field 2210) containing the bytes 

Length of Encoded Residual Bitmap 0x54 0x03 0x6f 0x8d 0x50 

This is a 4 byte value (value 2150) giving the length in File Version 

bytes of the encoded residual bitmap. This is a 1-byte field (field 2220) containing the version 

Encoded Residual Bitmap 25 of the encapsulation used. Currently this value is 9. 

. r Length of Encoded Dictionary Block 

lnis is a CCITI Group-4 encoded representation 2160 of Jr. . , , ^ , / / . - i 

, u -j i uv -ru a • t-^t Th 15 & a ^-oyte value (value 2230) giving the length in 

me ; residua oumap. i ne uroup -4 compression uses no hUL b _ of the dictionary block . ^ va , ue 

is stored in network 

codes, and fills bytes MSB-to-U>B. As in the case of byte order (MSB first)? as m aU {he othef numefical yalues 

dictionary height classes, in Subsection 3.1.4, this bitmap 30 ^ the higher-level encapsulation, 

may optionally be stored uncompressed; this is flagged by a Dictionary Block 

byte-count value of zero. This is a dictionary block (dictionary block 2240), in the 

format described in Section 3. Currently, there is only one 

6. Huffman Encoding dictionary block, and dictionary clearing codes are not used; 

35 modifying the encapsulation to support these is not difficult. 

The algorithm used to build the Huffman tree is: Number of Pages 

Build an array of the token use counts. Tokens whose use ^ 15 f 2 '^ ValUC (value 2250) ® ving the nUmber ° f 

• j j . pages m this file, 

counts are given as zero are considered to have a use p & 

count of one (these are single-use tokens). The order of c u r*u / jj a j j 

tU . \j U . j . . • . . , 40 Each of these (e.g., encoded page 1 2260; encoded page 

the array should be the exact order in which the tokens 2 22JQ; ^ 

occurred m the file up to this point. After a dictionary 2280) ^ encoded ^ shown in FIG n The fidds of 

clearing code, the order of any retained tokens is the 2300 are* 

order in which they appeared in the list of retained p a g e pjj e N ame 

tokens - 45 This is a NUL-terminated string (string 2310) giving the 

Scan the current array for the two lowest- value elements. name of the file that this page originally came from, or other 

In cases of ties, always choose the element closest to identifying information, 

the start of the array. This can be done using a priority Page Width 

queue with a primary key of the use count, and a This is a 2-byte value (value 2320) giving the width of this 

secondary key of the position in the array. 50 page's bitmap. 

Create a tree node representing the merger of these two Pa ^T . H " ght « , t . , , . . ...... c 

, , l4 . • *u r fu • * Th^ is a 2-byte value (value 2330) giving the heigh of 

elements. Its use count is the sum of their use counts. , ; v J& 6 5 

this page s bitmap. 

In the array, replace the first of these two elements (the Length of Encoded Position Block 

one closest to the start of the array) with this merged 55 This is a 4-byte value (value 2340) giving the length in 

node. Remove the second element from the array (but bytes of the page's position block, 

don't forget it). Position Block 

Continue until the array contains only a single node. Th* 5 ^ a position block 2350, in the format described in 

Use this tree to find the length of the Huffman code for ^ ect . 1 ? D , 4 ' , 

each token: traverse the tree down to each token; the 60 Re !I dual Block - H . , . t ,„ n . f fA ., , . 

length of this path is the number of bits in the code for Q ™* » \' eSldua block ^ m XhC J°T\ u 

that token Section 5. It is not necessary to encode the length of the 

residual block, as it can easily be determined by scanning the 

Assign the codes themselves using the "canonical Huff- first few bytes. 

man code" assignment algorithm: 65 7.1. Embedding Within TIFF 

Let c[l] be the number of codes of length 1 bits. TIFF is currently commonly used to store CCITT 

Assume that the maximum possible code length is 32. Group-4 compressed bitmaps. This subsection briefly 



12/08/2003, EAST version: 1.4.1 



6,011,905 

27 28 

describes how dictionary blocks, position blocks, and lead to situations where memory use is severely restricted; 

residual blocks could be embedded within TIFF files, to fortunately, these are exactly the situations where negotia- 

allow TIFF to represent token-compressed bitmaps. tion is possible. 

Since the decompressor needs to have seen all the die- The encoded token height classes and residual bitmaps are 

tionary blocks preceding a position block in order to get the 5 compressed using CCITT Group-4 compression, or are 

decompression right, these dictionary blocks should be as stored uncompressed in the cases where Group-4 actu- 

easy to find as possible. Preferably, there is at most one ally increases their size. This was chosen because 

block per page, stored (as a tag) in the top-level directory for systems (both hardware and software) to perform 

that page. As the decompressor walks through the file to get Group-4 compression and decompression are common 

to a particular page, it therefore has to pass by all the 10 and <l uite sim P le * bitmaps could be stored with 

dictionary blocks it will need. It doesn't need to parse them an y suitable lossless bia *y compressor; JBIG would be 

until it actually runs into a token-compressed binary image, A one choice, 

but just remember their positions (and order). A ^ 1C " of ^ n y entlo ° 

lie position blocks, on the other hand, should re-use as ^^"7 ^ TJ ^ ^ t 

u *ui c A. - c -i t_i c t_- Next > further applications of the invention will be 

much as possible of the information available for binary is discussed. High-speed printing was mentioned earUer as one 

images. They should be stored as regular binary images, but application . ^ exemplary rendering components 200 that 

using a variant compression method (the TIFF spec allows were illustrated in FIG, 10 suggest other applications, 

compressed images to be tagged by the compression method including prepress viewing, desktop publishing, document 

use< 0- management systems, and distributed printing applications, 

The residual blocks could also be stored as binary images, 20 as well as fax communications. In general, the invention can 

in the same pages as the corresponding position blocks; find application in any situation where quick, high-quality 

storing multiple images for the same page is allowed by the document rendering is needed. 

TIFF spec (but it does not adequately specify how they The invention is particularly appropriate for interactive 

should be combined). documents, such as World Wide Web documents. Because of 

8. Further Discussion 25 * he expressiveness of the tokenized representation 

(especially as compared with HTML), Web documents 
Here are some additional issues related to the current encoded in DigiPaper format can be rendered with fidelity 
DigiPaper format and to possible variations of the format. comparable to print media. Moreover, rendering speeds of 
In the current format, a position block represents an entire under 1 second per page for text and graphics are achievable, 
page. In some applications (notably a fax output 30 This means fewer unwanted delays for users downloading 
device), pages might be broken down into slices; this documents from remote Web servers, 
means that the page can start being printed as quickly The flowchart of FIG. 24 illustrates a simple interaction 
as possible, once the first page slice is decoded. Each between a Web server and a client computer running a Web 
page slice would comprise a position block and a client (browser) program, such as Netscape Navigator 
residual block. 35 (Netscape Communications, Inc., Mountain View, Calif.), 
The top-level format would have to change slightly to that supports the Java programming language (available 
accommodate this: dictionary blocks would occur within a from Sun Microsystems, Inc.). The client computer receives 
page (between page slices). This conflicts with the goal of a command indicating that the client computer's user has 
allowing easy access to a single page: the decoder must read selected a hypertext link pointing to a new Web page (step 
through the page and pick up those dictionary blocks in 40 AA) encoded in DigiPaper format. The computer responds 
order to be able to decode some subsequent page. However, by following the selected link (step BB), and beginning to 
it still does not need to completely decode each page slice download the selected page. The first thing to be down- 
position block. loaded is a Java-language program, or applet (step CC), 
Any given document can have a large number of which the client computer automatically begins to execute, 
representations, depending on how the coder classifies 45 By executing the Java applet, the client computer is caused 
the tokens on each page, where it places dictionary to download a data file containing a DigiPaper tokenized 
blocks and dictionary clearing codes, its choice of representation of the displayable text and graphics that make 
encoding tables, how pages are broken down into page up the readable content of the Web page (step DD). The 
strips, and so on. Memory requirements in the encoder applet also includes DigiPaper decompressor software, so 
and decoder can restrict the representations that can be 50 that once the tokenized representation has been downloaded, 
successfully generated or decoded. When the encoder the client computer can render it (step EE) and display the 
and decoder are conversing directly (as in a transmis- resulting Web page (step FF). The DigiPaper representation 
sion to a fax output device), they can negotiate a can include extensions to support the hypertext links embed- 
memory limit, and the encoder can ensure that the ded in the downloaded Web page, and the applet can 
decoder will not exceed this limit, by breaking each 55 recognize the user's selection of new links on the decom- 
page down into small enough strips (to reduce the page pressed page (continuing in step FF). Depending on what the 
image buffer memory requirements), and by inserting user decides to do next (step GG), the applet can either link 
dictionary clearing codes (to reduce the token dictio- to a new page (step BB) in response to the user's selection 
nary memory requirements). Such restrictions are of a link on the downloaded DigiPaper page, or can return 
likely to degrade compression. 60 control to the browser (step HH). If a new Web page is 
When the document is compressed into a file, such a selected, the applet remains in control; in particular, if the 
negotiation is not possible, and so decoders reading from newly selected page is a DigiPaper page, the applet need not 
such stored files must be prepared to use a (potentially) large be downloaded again (step BB). If the user has, for example, 
amount of memory. However, in such a situation, the selected a browser function not immediately related to the 
decoder is likely to be running on some powerful general- 65 contents of the currently displayed page, the applet can 
purpose computer, so this requirement is not too onerous. terminate or suspend, and control can return to the main 
For fax machines, on the other hand, cost requirements can browser program (step HH). 
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This example shows that where a DigiPaper tokenized 
document representation is bundled with a decompressor 
applet, the resulting package is, in effect, a self- rendering file 
format. So long as the browser supports the industry- 
standard Java language, the browser need not be specifically 
enabled for DigiPaper. The applet takes care of that. 
Variations and Alternative Embodiments 

Many alternative embodiments of the invention are pos- 
sible. Here are a few examples: 
The structured representation of the source document 
need not be a PDL representation. Other possibilities 
include document exchange formats (e.g., PDF, Com- 
mon Ground) and PCL5. In general, any non-image- 
based structured document representation can be used. 
Although the DigiPaper file format is the preferred format 
for the tokenized representation, other structured docu- 
ment representations can be used. One possibility is to 
use a highly reduced subset of a PDL. The subset need 
include only a few operators, just enough to denote 
what the bitmaps are for the various symbols and where 
the symbols are to be positioned within the rendered 
image, along with basic commands to cause the sym- 
bols to be drawn at the desired positions. For example, 
in PostScript, the subset can be the operators 
imagemask, moveto, rmoveto, definefont, and show; 
these operators are defined in the PostScript Manual at 
pages 435, 456, 4S3, 398, and 520, respectively. In 
particular, the definefont operator can accept bit- 
mapped fonts, and thus can be used to define the token 
bitmaps. 

Although the image-based DigiPaper tokenized represen- 
tation is resolution-dependent, it is nevertheless pos- 
sible to convert it to print or display at a resolution 
other than the one at which it was tokenized. This can 
be done, for example, by downsampling. The resulting 
images can be of acceptable quality for many applica- 
tions. 

The residual image for a page can be considered as just 
another token, although it is stored outside the dictio- 
nary block for efficiency. Alternatively, the residual 
image can be stored in the dictionary block, as a token 
or set of tokens. 

The inventive compression technique can be incorporated 
in a document compression system that supports both 
lossy encoding of scanned pages, and lossless encoding 
of rendered pages. Specifically, the inventive technique 
is used to provide lossless symbol-based representation 
of rendered text/graphics. Symbol-based techniques of 
the prior art can be used to encode scanned document 
pages containing text and graphics; preferably, the 
same file format (e.g., DigiPaper) is used for both the 
lossy and the lossless technique, so that the same 
rendering engine can be used regardless of the source 
of the document image. Another technique, such as 
JPEG or other lossy encoding technique, can be used 
for color and gray bitmap images (e.g., photographs). 
Conclusion 

A new, computationally efficient method of compiling a 
page description language into a tokenized, fontless struc- 
tured representation and of quickly rendering this fontless 
representation to produce a document image has been 
described. A compressor or tokenizer takes a set of page 
images, formed directly from a PDL file or other structured 
representation of a document, and converts these page 
images into a tokenized representation based on tokens and 
positions. A decompressor reconstructs the page images 
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from the stored tokens and positions, building up an overall 
bitmap image for each page from the component subimages 
of tokens whose shapes occur on that page. The tokenized, 
fontless structured representation employed by the inventive 

5 method provides a degree of expressiveness equal or com- 
parable to what has previously been available only with 
PDLs. Yet this representation is highly compact and can be 
rendered very quickly and predictably, and can conveniently 
be bundled with decompression software to provide self- 

10 extracting, self-displaying documents. The inventive 
method can be embodied in hardware configurations that 
include both general-purpose computers and special- 
purpose printing and imaging devices. 
Browse-Now-Print-Later 

15 As described previously with reference to FIG. 24, the 
present invention is particularly appropriate for use with 
interactive documents, such as World Wide Web documents. 
In particular, Web documents encoded in DigiPaper format 
can be rendered with high fidelity. Moreover, where a 

20 DigiPaper tokenized document representation is bundled 
with a decompressor applet, the resulting package is, in 
effect, a self-rendering file format. 

There are times when it is desirable to control user access 
to particular Web documents. Consider, for example, a 

25 copyrighted document stored on a Web server and accessible 
from one or more Web pages. Suppose that a user browses 
a Web page containing the copyrighted document, possibly 
paying for the privilege, and proceeds to download the 
document from the server. Thereafter, there is typically 

30 nothing other than respect for the law to prevent the user 
from making and distributing any number of digital copies 
of the downloaded document, thereby potentially undermin- 
ing the value of the copyrighted work. Thus it would be 
desirable if somehow the user can be given the ability to 

35 browse the document on the Web, yet still be prevented from 
obtaining a high-quality digital copy of the document. 

At the same time, the user's need for a high-quality 
printed copy of the document must also be met. For many 
people, typical visual displays such as CRTs, backlit LCDs, 

40 and the like are not comfortable for intensive or lengthy 
reading tasks. In particular, the resolution of these displays 
is too low. For this reason, Web users will often choose to 
print out a Web document and read it on paper, rather than 
attempt to read the document in its entirety right from the 

45 display screen. Typically, the printed output is easier to read 
than the displayed output and, in particular, provides higher 
resolution. For example, whereas a screen display of 72 dpi 
resolution may be uncomfortable for extensive reading 
tasks, a laser-printed page at 600 dpi resolution or higher is 

50 quite comfortable for most readers. 

DigiPaper provides a solution to both of these problems. 
It will be recalled from the description in earlier sections that 
DigiPaper provides a resolution -dependent structured docu- 
ment representation. The resolution-dependence of 

55 DigiPaper, together with its favorable speed and compres- 
sion ratio, means that DigiPaper document representations 
can be readily made available at different resolutions in 
different media to different parties, with different levels of 
trust and security. In particular, low-resolution Web brows- 

60 ing and high-quality, high-resolution printing can be 
decoupled from one another. A Web user can browse a 
copyrighted document electronically at low resolution and 
can upon request obtain a high-resolution printed copy, all 
without being given access to a high-quality digital copy of 

65 the work. 

According to the invention in an embodiment that will 
now be described, a resolution -independent structured rep- 
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resentation of a source document, such as a PDL art representation 2532). Meanwhile, the print shop or other 

representation, is losslessly converted into two (or more) trusted printing facility collects and processes the fee and 

different DigiPaper tokenized representations at different forwards any applicable copyright royalties to the copyright 

respective resolutions. For example, a PDL original is holder or holders. The user's payment can be made, for 

encoded as a low-resolution DigiPaper representation and 5 example, by electronic debit or credit, or over the Internet if 

also as a high-resolution DigiPaper representation. The a sccaK pay-by-Internet scheme is available; alternatively, 

low-resolution representation is suitable for on-line Web an invoice can be sent to the user. 

browsing and screen display, but is of insufficient resolution FIG * 26 lllust rates the document encoding stage in a 

for high-quality printing. Only the high-resolution represen- browse-now-print-later embodiment of the invention. A 

tation is suitable for producing high-quality printed copies of to !^ U T^?m source document representation 

the document 2601, such as a PDL file, is provided via one or more secure 

~ , ' , .. , • * a u communications channels 2609 to encoders 2610, 2630 to 

T^e low-resolution representation is posted on a Web be {Q Di ip tokenized format at low ' and w 

server and so becomes available for on-line browsing by any resolutions respectively. Encoders 2610, 2630 can be, for 
and all Web users, including those who may be untrustwor- example, two distinct computers running different DigiPaper 
thy or unaware of the copyright laws. These users can 35 encoding programs or a single computer running a program 
browse the document on the Web, typically free of charge, that accepts an input parameter to control encoding resolu- 
simply by pointing their Web browsers (clients) to the Web tion. The low resolution (for example, 72 dpi) is acceptable 
site or sites where the low-resolution DigiPaper representa- for generating screen displays but unacceptable for high- 
tion resides. A user who is interested in the document and quality printing. The high resolution (for example, 600 dpi) 
wants to obtain a high-quality printed copy to read sends the 20 is appropriate for high-quality printing. Secure communica- 
Web server a request for a printout. In response, the server tions channels 2609 can be, for example, dedicated hard- 
contacts a trusted printing facility (for example, a print wired or telephonic links or suitably encrypted network 
bureau or copy shop) and provides that facility via a secure pathways. In any event, information sent across secure 
communications channel with the high-resolution DigiPaper channels 2609 is not readily subject to unauthorized inter- 
representation of the document. From the high-resolution 25 ception or copying. 

DigiPaper representation, the trusted printing facility prints The low-resolution DigiPaper representation formed by 

a hardcopy of the document, which is in turn made available low-resolution encoder 2610 is provided, in this 

(for example, delivered or mailed) to the user who requested embodiment, via an insecure communications channel 2611 

it. The user is billed accordingly, and appropriate copyright to a display server or service 2620, while the high-resolution 

royalties flow to the copyright holder. Importantly, the user 30 representation formed by high-resolution encoder 2630 is 

never has access to the high -resolution DigiPaper represen- provided via a secure channel 2631 to a print server or 

tation of the document, and so is effectively precluded from service 2640. Insecure channel 2611 can be, for example, an 

making high-quality digital copies of the document. That is, unencrypted Internet pathway. The low-resolution represen- 

the only digital copy of the document to which the user is tation sent across insecure channel 2611 is subject to unau- 

granted access is a low-resolution copy. 35 thorized interception and copying. Secure channel 2631, like 

This style of Web use according to the invention can be channels 2609, can be, for example, dedicated hardwired or 

called browse-now-print-later (or, somewhat more precisely telephonic link or a suitably encrypted network pathway, or 

but harder to say, browse- insecurely-print -securely). It is any communications channel not readily subject to unau- 

illustrated in the conceptual example of FIG. 25. A Web user thorized interception. Display server or service 2620 and 

runs a Web browser (client) on a PC or other local computer. 40 print server or service 2640 can be, for example, two 

The user sees displayed, for example, a window 2510 physically separate computers (that is, servers) or can be two 

through which the user interacts with the browser. A low- processes (that is, services) executing on a single computer, 

resolution representation 2520 of a document can be seen in Note that print server or service 2640 is entrusted with the 

window 2510. It includes low-resolution text representation safekeeping of the high-resolution DigiPaper representation, 

2521 and low-resolution line art representation 2522. The 45 and so must itself be trustworthy. 

user is free to download low-resolution representation 2520, FIG. 27 illustrates a simple example of the decoding in a 

and even to reproduce it (for example, to let friends or browse-now-print-later embodiment of the invention. In the 

customers know about the document). Low-resolution rep- embodiment of FIG. 27, both the low-resolution and high- 

resentation 2520 is good enough to show the user whether resolution DigiPaper representations of the document are 

the document is one of interest, but is not good enough for 50 stored on a single server 2705. Server 2705 is located 

the user comfortably to read low-resolution text 2521 or remotely from user 2790 and is trusted by the copyright 

appreciate the details of line art 2522. holder or holders. Server 2705 provides display service 2706 

Also present in window 2510 is a hypertext link 2523 and print service 2707. 
which the user can select, for example with a click of the Display service 2706 provides the low-resolution Digi- 
mouse or other pointing device, to order a high-resolution 55 Paper representation via insecure channels 2711 through a 
hardcopy of the document. Upon issuance of the request, network 2710 (for example, the Internet or a corporate 
combined with appropriate payment or credit as indicated by intranet) to a client computer 2720 that runs a Web browser 
arrow 2525, the user's order is transmitted to a print shop or software program. Client 2720 is untrusted; that is, the 
other trusted printing facility, along with a high-resolution person using client 2720 (here, user 2790) is someone who 
representation of the document. A high-resolution printed 60 cannot be relied upon to refrain from making unauthorized 
copy 2530 is made from the high-resolution representation. copies of the document. Similarly, insecure channels 2711 
Thereafter, the print shop can mail or deliver high-resolution are susceptible to interception by unauthorized parties. Cli- 
printed copy 2530 to the user, or the user can visit the print ent 2720 runs Web browser software which produces low- 
shop and pick up copy 2530 there. The user can comfortably resolution displayed output 2725 that can be viewed on-line 
read high -resolution copy 2530, which clearly shows the 65 2729 by user 2790. 

text and line art of the document (respectively, as high- Display service 2706 can also accept hardcopy requests 

resolution text representation 2531 and high -resolution line from client 2720 and communicate these, securely, to print 
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service 2707. Upon receiving such a request, in this embodi- 
ment print service 2707 provides a high-resolution DigiPa- 
per representation via a secure dedicated communications 
channel 2731 to a printer 2730. Printer 2730 is trusted; that 
is, the person or facility that operates printer 2730 is 
someone who is trusted not to make unauthorized copies, 
and is further trusted to handle the financial accounting 
associated with providing printed copies to users. Printer 
2730 generates a high-quality, high-resolution printed output 
2735 and also generates an invoice or the like as indicated 
at 2733. Printed output 2735 is physically delivered 2739 to 
user 2790, who can then read it. Delivery 2739 can be made, 
for example, by mail, air or ground transport, or the like. 
Alternatively, user 2790 can pick up printed output 2735 
from the facility where printer 2730 is kept. Typically, to 
improve user convenience while at the same time ensuring 
document security, printer 2730 is located at a print service 
bureau, copy shop, or other facility located relatively near to 
user 2790 but not on the premises of user 2790, or at least 
not accessible to user 2790 without proper authorization. 

FIG. 28 illustrates a more complex example of decoding 
in a browse-now-print-later embodiment of the invention. In 
the embodiment of FIG. 28, the low-resolution DigiPaper 
representation of the document is stored on a display server 
2806, and one or more high-resolution DigiPaper represen- 
tations are stored on a print server 2807. Display server 2806 
communicates with untrusted clients 2820a, 28206, 2820c 
via insecure channels 2811 across network 2810, and with 
print server 2807 via secure channels 2831, which can run 
directly between display server 2806 and print server 2807 
or can go through network 2810 as the case may be. Display 
server 2806 can transmit the low-resolution DigiPaper rep- 
resentation of the document to clients 2820a, 28206, 2820c, 
which can then display them to their respective users (not 
shown) as low-resolution display outputs 2825a, 28256, 
2825c. 

Display server 2806 also can receive requests for hard- 
copy output from clients 2820a, 28206, 2820c. It forwards 
these requests via secure channel 2831 to print server 2807. 
Preferably the request to print server 2807 is communicated 
over a secure channel, rather than an insecure channel, from 
display server 2806, to prevent interlopers from requesting 
printed copies without proper authorization. 

Upon receiving a request for hardcopy output from dis- 
play server 2806, print server 2807 notifies an accounting 
server 2833 so that invoicing can proceed, and transmits the 
high -resolution DigiPaper representation of the document 
via secure channels 2831 across network 2810 to trusted 
printer 2830, which produces therefrom a high-resolution 
printed output 2839. 

FIG. 28 also shows an unauthorized eavesdropper 2890 
intercepting the communications from display server 2806 
to clients 2820a, 28206, 2820c. Such interception is possible 
because channels 2811 are insecure. However, eavesdropper 
2890 can only intercept the low-resolution representation of 
the document. The high-resolution representation, sent to 
and from print server 2807 via secure channels 2831, is 
inaccessible. 

Several points common to the embodiments of FIGS. 27 
and 28 are worth noting. In each embodiment, both of the 
servers or services are trusted by the copyright holder or 
other rightful possessor of the document, and the links 
between them are secure. Likewise, the output printer is 
trusted. Neither the servers) nor the printer are accessible to 
unauthorized users. Clients and users are considered to be 
untrusted and untrustworthy, and communications with them 
can therefore be insecure. 
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There exist known file formats that store a single docu- 
ment as a set of several representations at several different 
resolutions. However, the use of multiple representations of 
a source document at multiple resolutions to facilitate access 
control, especially in the context of the Web, is new to the 
present invention. 

Use of browse-now-print-later with alternate tokenized 
encoding formats, multiple print resolutions (for example, 
offering users a range of printed outputs at resolutions such 
as 300 dpi, 600 dpi, 1200 dpi, etc.), caching of the high- 
resolution document representation at the trusted printer to 
avoid unnecessary retransmission, and other extensions and 
modifications will be apparent to those of skill in the art. 

The foregoing description illustrates just some of the uses 
and embodiments of the invention, and many others are 
possible. Accordingly, the scope of the invention is not 
limited by the description, but instead is given by the 
appended claims and their full range of equivalents. 

TABLE 1 





Encoding for 32-bit values 


Value 


Encoding 


0 ... 127 


0 + value encoded as 7 bits 


12S . . . 1151 


10 + (value - 328) encoded as 10 bits 


1152 . . . 32767 


11 + value encoded as 15 bits 


32768 . . . «> 


1100000 + value encoded as 32 bits 


TABLE 2 




Width/heieht encoding table 


Value 


Encoding 


0 


0 


1 


30 


2 


310 


3 ... 9 


1110 + (value - 3) encoded as 3 bits 


10 ... x 


3111 + (value - 10) encoded as in Table 1 


END 


1110111 


TABLE 3 




Use count encoding table 0 


Value 


Encoding 


0 


100 


1 


0 


2 


101 


3 ... 34 


110 + (value - 3) encoded as 5 bits 


35 ... oo 


111 + (value - 35) encoded as in Table 1 


TABLE 4 




Use count encoding table 1 


Value 


Encoding 


2 


0 


3 


100 


4 


1010 


5.. .6 


1011 + (value - 5) encoded as 1 bit 


7 ... 10 


1100 + (value - 7) encoded as 2 bits 


11 ... 14 


31110 + (value - 11) encoded as 2 bits 


15 . . .30 


1101 + (value - 15) encoded as 4 bits 
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TABLE 4-continued 







Use count encoding table 1 


Value 


Encoding 


31 


... 94 


1110 + (value - 31) encoded as 6 bits 


95 


... CO 


11111 + (value - 95) encoded as in Table 1 


TABLE 5 






First X encoding table 0 


Value 




Encoding 


-co . . . 


-2048 


1110111110 + (-2048 - value) encoded as in Table 1 


-2047 . 


, . . -1024 


1010 + (value + 2047) encoded as 10 bits 


-1023 . 


, . . -512 


1011 + (value + 1023) encoded as 9 bite 


-511 . , 


. . -256 


1100 + (value + 511) encoded as 8 bits 


-255 . 


. . -128 


1101 + (value + 255) encoded as 7 bits 


-127 . 


. . -64 


11110 + (value + 127) encoded as 6 bits 


-63 . . 


. -32 


11111 + (value + 63) encoded as 5 bits 


-31 . . 


. -1 


1110 + (value + 31) encoded as 5 bits 


0 ... 127 


00 + value encoded as 7 bits 


128 . . 


. 255 


010 + (value - 128) encoded as 7 bits 


256 . . 


. 511 


Oil + (value - 256) encoded as 8 bits 


512 . . 


. 1023 


1000 + (value - 512) encoded as 9 bits 


1024 . . 


. . 2047 


1001 + (value - 1024) encoded as 10 bits 


2048 . . 


, . oo 


1110111111 + (value - 2048) encoded as in Table 1 


TABLE 6 






First X encoding table 1 


Value 




Encoding 


-oo -1024 


1011111110 + (-1024 - value) encoded as in Tabic 1 


-1023 . 


. . -512 


000 + (value + 1023) encoded as 9 bits 


-511 . . 


. -256 


001 + (value + 511) encoded as 8 bits 


-255 . . 


. . -128 


1010 + (value + 255) encoded as 7 bits 


-127 . . 


. . -64 


11100 + (value + 127) encoded as 6 bits 


-63 . . 


. -32 


11101 + (value + 63) encoded as 5 bits 


-31 . . 


. -1 


1011 + (value + 31) encoded as 5 bits 


0 ... 31 


1100 + value encoded as 5 bits 


32. . . 


63 


11110 + (value - 32) encoded as 5 bits 


64 . . . 


127 


11111 + (value - 64) encoded as 6 bits 


128 . . , 


. 255 


1101 + (value - 128) encoded as 7 bits 


256 .. . 


. 511 


010 + (value - 256) encoded as 8 bits 


512 .. . 


. 1023 


011 + (value - 512) encoded as 9 bits 


1024 . . 


. 2047 


100 + (value - 1024) encoded as 10 bits 


2048 . . 


. oo 


1011111111 + (value - 2048) encoded as in Table 1 



TABLE 7-continued 



Delta X encoding table 0 



10 



Value 




Encoding 


390.. 


. 645 


1111110 + (value - 390) encoded as 8 bits 


646 . . 


. 1669 


111101 + (value - 646) encoded as 10 bits 


1670. 


. . ce 


11111001111 + (value - 1670) encoded as in Table 1 


END 




00 



15 

TABLE 8 



Delta X encoding table 1 



25 



Value 




Encoding 


-oo . , 


. . -30 


11111001110 + (-30 - value) encoded as in Table 1 


-29 . 


. . -16 


1111100 + (value + 29) encoded as 4 bits 


-15 . 


. . -12 


111111110 + (value + 15) encoded as 2 bits 


-11 . 


. - -8 


11111110 + (value + 11) encoded as 2 bits 


-7 . . 


. -6 


111111111 + (value + 7) encoded as 1 bit 


-5 . . 


. -4 


1111101 + (value + 5) encoded as 1 bit 


-3 . . 


. -2 


1010 + (value + 3) encoded as 1 bit 


-1 . . 


. 0 


010 + (value + 1) encoded as 1 bit 


1 . . . 


2 


011 + (value - 1) encoded as 1 bit 


3 . . . 


4 


11010 + (value - 3) encoded as 1 bit 


5 . . . 


6 


111010 + (value - 5) encoded as 1 bit 


1 ... 


38 


100 + (value - 7) encoded as 5 bits 


39 . . 


.42 


111011 + (value - 39) encoded as 2 bits 


43 . . 


. 74 


1011 + (value - 43) encoded as 5 bits 


75 . . 


. 138 


1100 + (value - 75) encoded as 6 bits 


139 . 


. . 266 


11011 + (value - 139) encoded as 7 bits 


267 . 


. . 522 


11100 + (value - 267) encoded as 8 bits 


523 . 


. . 778 


111100 + (value - 523) encoded as 8 bits 


779 . 


. . 1290 


1111110 + (value - 779) encoded as 9 bits 


1291 


. . . 3338 


111101 + (value - 1291) encoded as 11 bits 


3339 


. . . 00 


11111001111 + (value - 3339) encoded as in Table 1 


END 




00 



TABLE 9 

Delta X encoding table 2 



TABLE 7 



Delta X encoding table 0 



Value 




Encoding 


-oo . . 


. -15 


11111001110 + (-15 - value) encoded as in Table 1 


-14. 


. . -8 


1111100 + (value + 14) encoded as 3 bits 


-7 . . 


.-6 


111111110 + (value + 7) encoded as 1 bit 


-5 . . 


. -4 


11111110 + (value + 5) encoded as 1 bit 


-3 




111111111 


-2 




1111101 


-1 




1010 


0 . . . 


1 


01 + value encoded as 1 bit 


2 




11010 


3 




111010 


4 . . . 


19 


100 + (value - 4) encoded as 4 bits 


20.. 


. 21 


111011 + (value - 20) encoded as 1 bit 


22. . 


. 37 


1011 + (value - 22) encoded as 4 bits 


38 . . 


. 69 


1100 + (value - 38) encoded as 5 bits 


70 . . 


. 133 


11011 + (value - 70) encoded as 6 bits 


134. 


. . 261 


11100 + (value - 134) encoded as 7 bits 


262. 


. . 389 


111100 + (value - 262) encoded as 7 bits 



Value Encoding 



-» . . . -20 1101101110 + (-20 - value) encoded as in Table 1 

50 -19 . . . -6 110110 + (value + 19) encoded as 4 bits 

-5 11111110 

-4 111100 

-3 11000 

-2 ... 1 01 + (value + 2) encoded as 2 bits 

2 11001 

55 3 HOlll 

4 1111101 

5 11111111 

6 ... 69 10 + (value - 6) encoded as 6 bits 

70 . . . 101 11010 + (value - 70) encoded as 5 bits 

102 ... 133 1110000 + (value - 102) encoded as 5 bits 

134 ... 197 111001 + (value - 134) encoded as 6 bits 

19S . . . 325 111010 + (value - 198) encoded as 7 bits 

326 .. . 581 111011 + (value - 326) encoded as 8 bits 

582 .. . 1093 111100 + (value - 582) encoded as 9 bits 

1094 . . . 2117 111101 + (value - 1094) encoded as 10 bits 

2118 . . , 4165 1111110 + (value - 2118) encoded as 11 bits 

4166 ... co 1101101111 + (value - 4166) encoded as in Table 1 

65 END 00 
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TABLE 10 



38 



Delta Y encoding table 0 



Value 




Encoding 


1 




0 


2 ... 


3 


10 + (value - 2) encoded as 1 bit 


4 




1100 


5 ... 


6 


1101 + (value - 5) encoded as 1 bit 


1 ... 


8 


11100 + (value - 7) encoded as 1 bit 


9 ... 


12 


11101 + (value - 9) encoded as 2 bits 


13 . . 


. 16 


111100 + (value - 13) encoded as 2 bits 


17. . 


. 20 


1111010 + (value - 17) encoded as 2 bits 


21 . . 


. 28 


1111011 + (value - 21) encoded as 3 bits 


29 . . 


. 44 


1111100 + (value - 29) encoded as 4 bits 


45 . . 


. 76 


1111101 + (value - 45) encoded as 5 bits 


77 . . 


. 140 


1111110 + (value - 77) encoded as 6 bits 


141 . 


. . oo 


1111111 + (value - 141) encoded as in Table 1 



TABLE 11 



Delta Y encoding table 1 
Encoding 



1 




0 


2 




10 


3 . . 


. 4 


110 + (value - 3) encoded as 1 bit 


5 




11100 


6 . . 


. 7 


11101 + (value - 6) encoded as 1 bit 


8 . . 


. 9 


111100 + (value - 8) encoded as 1 bit 


10 




1111010 


11 . 


. . 12 


1111011 + (value - 11) encoded as 1 bit 


13 . 


. . 16 


1111100 + (value - 13) encoded as 2 bits 


17 . 


. .24 


1111101 + (value - 17) encoded as 3 bits 


25 . 


. .40 


1111110 + (value - 25) encoded as 4 bits 


41 . 


. . 72 


11111110 + (value - 41) encoded as 5 bits 


73. 


. . oo 


11111111 + (value - 73) encoded as in Table 1 



TABLE 12 



Delta Y encoding table 2 
Encoding 



1 

2 




0 

100 


3 




1100 


4 




11100 


5.. 


. 6 


1101 + (value - 5) encoded as 1 bit 


7 . . 


. 13 


101 + (value - 7) encoded as 3 bits 


14 . 


. . 15 


111010 + (value - 14) encoded as 1 bit 


16 . 


. . 19 


111011 + (value - 16) encoded as 2 bits 


20 . , 


. . 27 


111100 + (value - 20) encoded as 3 bits 


28. 


. . 43 


111101 + (value - 28) encoded as 4 bits 


44 . , 


. . 75 


111110 + (value - 44) encoded as 5 bits 


76 . , 


. . 139 


111111 + (value - 76) encoded as 6 bits 


140 . 


. . 00 


101111 - (value - 140) encoded as in Table 1 
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The claimed invention is: 

1. A method comprising the steps of: 

providing a processor with a first set of digital information 
comprising a first structured representation of a 
document, the first structured representation being a 
resolution-independent representation, a plurality of 
image collections being obtainable from the first struc- 
tured representation, each such obtainable image col- 
lection comprising at least one image, each image in 
each such collection being an image of at least a portion 
of the document, each image in each such collection 
having a characteristic resolution; 

with a processor, generating a first bitmap representation 
of the document with the first structured representation, 



55 



60 



65 



the first bitmap representation comprising an image 
collection including at least one image, each image in 
the image collection comprised by the first bitmap 
representation being an image of at least a page of the 
document; each image in the first bitmap representation 
having a first characteristic resolution; 
with a processor, producing from the first bitmap repre- 
sentation a second set of digital information comprising 
a second structured representation of the document, the 
second structured representation being a lossless rep- 
resentation of a low-resolution image collection, the 
second structured representation including a plurality 
of low-resolution tokens and a plurality of positions, 
the second set of digital information being produced by 
extracting the plurality of low-resolution tokens from 
the first bitmap representation, each low-resolution 
token comprising a set of pixel data representing a 
subimage of the low-resolution image collection, and 
determining from the first bitmap representation the 
plurality of positions of the second structured 
representation, each position of the second structured 
representation being a position of a low-resolution 
subimage in the low-resolution image collection, 
each low-resolution subimage being one of the sub- 
images from one of the plurality of low-resolution 
tokens, at least one low-resolution subimage having 
a plurality of pixels and occurring at more than one 
position in the image collection; 
with a processor, generating a second bitmap representa- 
tion of the document with the first structured 
representation, the second bitmap representation com- 
prising an image collection including at least one 
image, each image in the image collection comprised 
by the second bitmap representation being an image of 
at least a page of the document; each image in the 
second bitmap representation having a second charac- 
teristic resolution; the second characteristic resolution 
being greater than the first characteristic resolution; 
with a processor, producing from the second bitmap 
representation a third set of digital information com- 
prising a third structured representation of the 
document, the third structured representation being a 
lossless representation of a high-resolution image 
collection, the third structured representation including 
a plurality of high-resolution tokens and a plurality of 
positions, the third set of digital information being 
produced by 

extracting the plurality of high-resolution tokens from 
the second bitmap representation, each high- 
resolution token comprising a set of pixel data rep- 
resenting a subimage of the high-resolution image 
collection, and 
determining from the second bitmap representation the 
plurality of positions of the third structured 
representation, each position of the third structured 
representation being a position of a high-resolution 
subimage in the high-resolution image collection, 
each high-resolution subimage being one of the 
subimages from one of the plurality of high- 
resolution tokens, at least one high -resolution sub- 
image having a plurality of pixels and occurring at 
more than one position in the image collection; and 
making the second and third sets of digital information 

thus produced available for further use. 
2, The method of claim 1 wherein the providing step 
comprises providing the processor with a first structured 
representation selected from the group consisting of a page 
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description language representation, a document exchange 
format representation, a print control language 
representation, and a markup language representation. 

3. The method of claim 1 wherein the providing step 
comprises providing the processor with a first structured 5 
representation that is an original representation of the 
document, the original representation being a representation 
generated by a computer program wherein the document is 
created. 

4. The method of claim 1 wherein the providing step 
comprises providing the processor with a font-based first 
structured representation of the document, and wherein the 
producing step comprises producing a fontless second struc- 
tured representation of the document. 

5. The method of claim 1 wherein the step of making the 
second and third sets of digital information available for 15 
further use comprises: 

producing from the second set of digital information a 
representation of the document at the first characteristic 
resolution in a first medium; and 

producing from the third set of digital information a 20 
representation of the document at the second charac- 
teristic resolution in a second medium. 

6. The method of claim 5 wherein: 

the step of producing the representation of the document 
at the first characteristic resolution comprises display- 25 
ing the document at the first characteristic resolution 
with a visual display; and 

the step of producing the representation of the document 
at the second characteristic resolution comprises print- 
ing the document at the second characteristic resolution 30 
with a printer. 

7. The method of claim 1 wherein the step of making the 
second and third sets of digital information available for 
further use comprises communicating the second set of 
digital information in an insecure manner to an untrusted 35 
recipient. 

8. The method of claim 1 wherein the step of making the 
second and third sets of digital information available for 
further use comprises communicating the third set of digital 
information in a secure manner to a trusted recipient. 

9. The method of claim 1 wherein the step of making the 4 
second and third sets of digital information available for 
further use comprises displaying at least a portion of the 
second structured representation with a display device. 

10. The method of claim 1 wherein the step of making the 
second and third sets of digital information available for 
further use comprises printing at least a portion of the third 
structured representation with a printing device. 

U. The method of claim 1 wherein the step of making the 
second and third sets of digital information available for 
further use comprises: 50 

communicating the second set of digital information to a 
processor providing a display service; 

storing the second set of digital information thus com- 
municated so as to be accessible to the display service; 55 

communicating the third set of digital information in a 
secure manner to a processor providing a print service; 
and 

storing the third set of digital information thus commu- 
nicated with the print service. go 
12. Trie method of claim 11 and further comprising the 
steps of: 

with the display service, communicating at least a portion 
of the second structured representation to a recipient 
processor without communicating any portion of the 65 
third structured representation to the recipient proces- 
sor; 
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receiving in the display service a request from the recipi- 
ent processor to provide an output derived from at least 
a portion of the third structured representation; 

communicating said request in a secure manner from the 
display service to the print service; 

responsively to the request, communicating at least a 
portion of the third structured representation in a secure 
manner from the print service to a trusted output 
facility; and 

producing at the trusted output facility from the third 
structured representation an output comprising a 
human-readable representation of at least a portion of 
the document. 

13. The method of claim 12 wherein the final producing 
step comprises printing with a printer of the trusted facility 
a printed representation of at least a portion of the document, 
the printed representation having a characteristic resolution 
equal to the second characteristic resolution. 

14. The method of claim 1 further comprising the steps of: 
transmitting the second set of digital information in an 

insecure manner to an untrusted recipient; and 
transmitting the third set of digital information in a secure 
manner to a trusted recipient. 

15. The method of claim 1 further comprising the steps of: 
producing from the second set of digital information a 

fourth set of digital information comprising an image 
collection including at least one image, each image in 
the image collection comprised by the fourth set of 
digital information being an image of at least a portion 
of the document, each image in the image collection 
comprised by the fourth set of digital information being 
constructed of token subimages and including a token 
subimage in at least one position in the image collec- 
tion; and 

making the fourth set of digital information thus produced 
available for further use. 

16. An article of manufacture comprising an information 
storage medium wherein is stored information comprising a 
computer program for facilitating production by a processor 
of a second and third sets of digital information from a first 
bitmap representation and a second bitmap representation, 
respectively, the first bitmap representation and the second 
bitmap representation being generated from a first set of 
digital information, 

the first set of digital information comprising a first 
structured representation of a document, the first struc- 
tured representation being a resolution-independent 
representation, a plurality of image collections being 
obtainable from the first structured representation, each 
such obtainable image collection comprising at least 
one image, each image in each such collection being an 
image of at least a portion of the document, each image 
in each such collection having a characteristic 
resolution, 

the first bitmap representation comprising an image col- 
lection including at least one image, each image in the 
image collection comprised by the first bitmap repre- 
sentation being an image of at least a page of the 
document; each image in the first bitmap representation 
having a first characteristic resolution; 

the second set of digital information comprising a second 
structured representation of the document, the second 
structured representation being a lossless representa- 
tion of a low-resolution image collection, the second 
structured representation including a plurality of low- 
resolution tokens and a plurality of positions, 
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each low-resolution token comprising a set of pixel 
data representing a subimage of the low-resolution 
image collection, 
each position of the second structured representation 
being a position of a subimage in the low-resolution 5 
image collection, each low-resolution subimage 
being one of the subimages from one of the plurality 
of low-resolution tokens, at least one low-resolution 
subimage having a plurality of pixels and occurring 
at more than one position in the image collection, 10 
the second bitmap representation comprising an image 
collection including at least one image, each image in 
the image collection comprised by the second bitmap 
representation being an image of at least a page of the 
document; each image in the second bitmap represen- 35 
tation having a second characteristic resolution; the 
second characteristic resolution being greater than the 
first characteristic resolution; 
the third set of digital information comprising a third 
structured representation of the document, the third 20 
structured representation being a lossless representa- 
tion of a high-resolution image collection, the third 
structured representation including a plurality of high- 
resolution tokens and a plurality of positions, 
each high-resolution token comprising a set of pixel 25 
data representing a subimage of the high -resolution 
image collection, 
each position of the third structured representation 
being a position of a high-resolution subimage in the 
high-resolution image collection, each high- 30 
resolution subimage being one of the subimages 
from one of the plurality of high-resolution tokens, at 
least one high-resolution subimage having a plurality 
of pixels and occurring at more than one position in 
the image collection. 35 
17. Apparatus comprising: 
a processor; 

an instruction store, coupled to the processor; the instruc- 
tion store including document processing instructions 40 
for execution by the processor; the processor, in execut- 
ing the image processing instructions: 
receiving a first set of digital information comprising a 
first structured representation of a document, the first 
structured representation being a resolution- 45 
independent representation, a plurality of image col- 
lections being obtainable from the first structured 
representation, each such obtainable image collec- 
tion comprising at least one image, each image in 
each such collection being an image of at least a 50 
portion of the document, each image in each such 
collection having a characteristic resolution; 
generating a first bitmap representation of the docu- 
ment with the first structured representation, the first 
bitmap representation comprising an image collec- 5S 
tion including at least one image, each image in the 
image collection comprised by the first bitmap rep- 
resentation being an image of at least a page of the 
document; each image in the first bitmap represen- 
tation having a first characteristic resolution; 60 
producing from the first bitmap representation a second 
set of digital information comprising a second struc- 
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tured representation of the document, the second 
structured representation being a lossless represen- 
tation of a low-resolution image collection, the sec- 
ond structured representation including a plurality of 
low-resolution tokens and a plurality of positions, 
the second set of digital information being produced 
by 

extracting the plurality of low-resolution tokens 
from the first bitmap representation, each low- 
resolution token comprising a set of pixel data 
representing a subimage of the low-resolution 
image collection, and 
determining from the first bitmap representation the 
plurality of positions of the second structured 
representation, each position of the second struc- 
tured representation being a position of a subim- 
age in the low-resolution image collection, each 
low-resolution subimage being one of the subim- 
ages from one of the plurality of low-resolution 
tokens, at least one low-resolution subimage hav- 
ing a plurality of pixels and occurring at more than 
one position in the image collection; 
generating a second bitmap representation of the docu- 
ment with the first structured representation, the 
second bitmap representation comprising an image 
collection including at least one image, each image 
in the image collection comprised by the second 
bitmap representation being an image of at least a 
page of the document; each image in the second 
bitmap representation having a second characteristic 
resolution; the second characteristic resolution being 
greater than the first characteristic resolution; 
producing from the second bitmap representation a 
third set of digital information comprising a third 
structured representation of the document, the third 
structured representation being a lossless represen- 
tation of a high-resolution image collection, the third 
structured representation including a plurality of 
high-resolution tokens and a plurality of positions, 
the third set of digital information being produced by 
extracting the plurality of high-resolution tokens 
from the second bitmap representation, each high- 
resolution token comprising a set of pixel data 
representing a subimage of the high-resolution 
image collection, and 
determining from the second bitmap representation 
the plurality of positions of the third structured 
representation, each position of the third struc- 
tured representation being a position of a high- 
resolution subimage in the high-resolution image 
collection, each high-resolution subimage being 
one of the subimages from one of the plurality of 
high-resolution tokens, at least one high- 
resolution subimage having a plurality of pixels 
and occurring at more than one position in the 
image collection; and 
a data store, coupled to the processor, wherein the 
sets of digital information can be stored. 

***** 
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